Efficiently Finding Higher-Order Mutantsckaestne/pdf/arxiv20hom.pdfrealistic faults or improving test optimization techniques. Despite interest in studying promising higher-order mutants,

Efficiently Finding Higher-Order MutantsChu-Pan Wong

Carnegie Mellon UniversityJens Meinicke

Carnegie Mellon UniversityLeo Chen

Carnegie Mellon University

João P. DinizFederal University of Minas Gerais

Christian KästnerCarnegie Mellon University

Eduardo FigueiredoFederal University of Minas Gerais

ABSTRACTHigher-order mutation has the potential for improving major draw-backs of traditional first-order mutation, such as by simulatingmorerealistic faults or improving test optimization techniques. Despiteinterest in studying promising higher-order mutants, such mutantsare difficult to find due to the exponential search space of mutationcombinations. State-of-the-art approaches rely on genetic search,which is often incomplete and expensive due to its stochastic nature.First, we propose a novel way of finding a complete set of higher-order mutants by using variational execution, a technique that can,in many cases, explore large search spaces completely and often ef-ficiently. Second, we use the identified complete set of higher-ordermutants to study their characteristics. Finally, we use the identifiedcharacteristics to design and evaluate a new search strategy, inde-pendent of variational execution, that is highly effective at findinghigher-order mutants even in large code bases.

KEYWORDSMutation Analysis, Higher-Order Mutants, Variational Execution

1 INTRODUCTIONMutation analysis has been studied for decades in software en-gineering research [61], and increasingly adopted in industry inrecent years [65, 67]. Mutation analysis has many applications, in-cluding assessing and improving test suite quality, generating orminimizing a test suite, or as a proxy for evaluating other researchtechniques such as fault localization [28, 61]. Traditionally, muta-tion analysis injects syntactic mutations into an existing programand runs the existing test suite to assess whether the test suite issensitive enough to detect the mutations.

Higher-order mutation is the idea of combining multiple muta-tions with the goal of representing more subtle changes, more com-plex changes, or changes that better mirror humanmistakes [23]. Tothat end, Jia and Harman [23] distinguish first-order mutants, con-sisting of a single change, from higher-order mutants that combinemultiple changes. While most research on mutation analysis hasfocused on first-order mutants, higher-order mutation is promis-ing: For example, recent studies claim that higher-order mutantsare less likely to be equivalent mutants [35, 50, 53, 62], and thathigher-order mutants can reduce test effort [21, 23, 68]. In Sec-tion 2, we discuss a specific use case of higher-order mutants witha motivating example.

A key challenge in adopting higher-order mutation is identifyingbeneficial higher-order mutants. Most higher-order mutants are aseasy to kill as their constituent first-order mutants, due to coupling.Jia and Harman [23] argue that only a subset of all possible com-binations better simulate real faults and increase subtlety of the

seeded faults. Specifically, Jia and Harman [21, 23] look for whatthey name a strongly subsuming higher-order mutant (SSHOM), a par-ticular kind of higher-order mutant which is harder to detect thanits constituent first-order mutants, as we will explain in Section 2.However, SSHOMs are tricky to find among the vast quantity ofpossible combinations of first-order mutants. Current approachesuse genetic-search techniques, guided by a simple fitness func-tion [17, 21, 23, 43]. Since SSHOMs are difficult to find, little isknown about them and their characteristics.

In this work, we develop a technique that can find a complete setof SSHOMs for small to medium-sized programs, which enables us tostudy characteristics of SSHOMs. Based on the identified character-istics, we then develop a new heuristic search technique that is light-weight, scalable, and practical. Overall, we proceed in three steps:

(1) Variational Search: For the purpose of studying SSHOM ina controlled setting, we develop a new search strategy searchvarthat allows us to find a complete set of higher-order mutants fora given test suite and given set of first-order mutants in smallto medium-sized programs. Specifically, we use variational execu-tion [54, 55, 79], a dynamic-analysis technique that jointly exploresmany similar executions of a program. Conceptually, our approachsearches for all possible higher-order mutants at the same time,identifying, with a propositional formula for each test case, whichmutants and combinations of mutants cause a test to fail. Fromthese formulas, we then encode search as a Boolean satisfiabilityproblem and use BDDs or SAT solvers to enumerate all SSHOMs. Acomplete exploration with variational execution is often feasiblefor small to medium-sized programs, because variational executionshares commonalities among repetitive executions and becausemodern SAT solving techniques are relatively fast. Though it doesnot scale to all programs, analyzing a complete set of SSHOMs forsmaller programs allows us to study SSHOMs more systematically.

(2) Complete-Mutant-Set Analysis:We study the character-istics of the identified higher-order mutants from Step 1. Whereprevious approaches found only few samples of higher-order mu-tants, we have a unique opportunity to study the characteristics ofhigher-order mutants on a complete set. We analyze characteristics,such as, the typical number of mutants combined and their distancein the code. This helps us better understand higher-order mutantswithout the potential sampling bias from a search heuristic. Forexample, we found that most SSHOMs are composed of fewer than4 first-order mutants and that constituent first-order mutants tendto locate within the same method or the same class.

(3) Prioritized Heuristic Search: Finally, we develop a secondnew search strategy searchpri that prioritizes likely promisingcombinations of first-order mutants based on the characteristicsidentified in Step 2. The searchpri is easy to implement and does

arX

iv:2

004.

0200

0v1

[cs

.SE

] 4

Apr

202

0

Originalbool f(int a, int b): if (a == 1): return a < b return a > b

Mutation 1 (FOM)bool f(int a, int b): if (a != 1): return a < b return a > b

Mutation 2 (FOM)bool f(int a, int b): if (a == 1): return a >= b return a > b

Both (HOM)bool f(int a, int b): if (a != 1): return a >= b return a > b

Test Orig. Mut. 1 Mut. 2 Both Failure Cond.

T1 : assert f(1, 2) ✓ ✗ ✗ ✗ m1 ∨m2T2 : assert !f(0, 3) ✓ ✗ ✓ ✓ m1 ∧ ¬m2T3 : assert !f(1, 1) ✓ ✓ ✗ ✓ ¬m1 ∧m2

Figure 1: Example of mutations with their test outcomes.

not require the heavyweight variational analysis of searchvar. Al-though it no longer provides completeness guarantees, it is highlyefficient at finding higher-order mutants fast and scales much bet-ter to larger systems with tens of thousands of first-order mutants.We evaluate the new search strategy using a different set of largersystems to avoid potential overfitting. Our results indicate that thepreviously identified characteristics are useful in guiding the search.Our new search strategy can find a large number of SSHOMs de-spite an exponentially large search space, whereas existing searchapproaches can barely find any.

We make the following contributions in this work:• We propose a novel way of using variational execution tofind a complete set of SSHOMs for small to medium-sized pro-grams, by formalizing the search as a Boolean satisfiabilityproblem. An evaluation shows that we can achieve complete-ness and simultaneously increase efficiency. (Section 3)• Using a complete set of SSHOMs, we make a first step instudying basic characteristics of SSHOMs with the goal toinform future research. (Section 4)• To show how useful the characteristics are, we use them todesign a new lightweight prioritized search strategy, inde-pendent of variational execution. We evaluate the prioritizedsearch strategy on a fresh set of larger benchmarks, showingthat the new search is scalable and generalizable. (Section 5)

2 HIGHER-ORDER MUTANTSMutation analysis introduces a set of syntactic changes to a softwareartifact and observe whether the previously passing test suite issensitive enough to detect the changes (termed “to kill the mutant”).Traditionally, many simple small changes are explored in isolation,one at a time; several catalogues of mutation operators that performsmall syntactic changes exist [34, 61].

In its simplest form, higher-order mutants are combinations oftwo or more first-order mutants [21, 23]. The set of possible second-order mutants grows quadratically with the size of the set of first-order mutants from which they are combined; if considering com-bining more than two first-order mutants, the set of possible higher-order mutants grows much faster.

Many higher-order mutants are of little value in practice, becausea test that would kill any constituent first-order mutant will likelyalso kill the higher-order mutant, discussed as the coupling effecthypothesis [56]. However, Jia and Harman [23] show that thereexist several classes of higher-order mutants that are potentially

valuable, because they exhibit interesting behavior. They specifi-cally highlight strongly subsuming higher-order mutant (SSHOM), inwhich the constituent mutants interact in ways making the higher-order mutant hard to kill, as we will explain in detail in Section 2.2.

2.1 Usefulness of Higher-Order MutantsA recent survey of over 39 papers on higher-order mutation test-ing [13] summarized a large number of different application scenar-ios for higher-order mutants claimed in prior research, includingmutant reduction [12, 17, 19], coupling effect analysis [14, 23], equiv-alent mutant reduction [36, 50], test data evaluation [16], and testsuite reduction [17, 53]. In the following, we illustrate a concreteexample of how higher-order mutations can be useful to software-engineering researchers for creating synthetic, but challengingfaults to evaluate various software engineering tools.

The effectiveness of many approaches in software-engineeringresearch needs to be evaluated on faults in software systems. Forexample, fault localization tools need to evaluate how accuratelythey can localize the faults, test suite generation tools need toevaluate how effective the generated tests are at finding bugs, andprogram repair tools need to evaluate how many faults they canrepair. When evaluating their tools, researchers often have thechoice of running evaluations on a curated, often small, set of realbugs or running on large numbers of synthetically seeded bugs.Both approaches have known benefits and drawbacks:• Seeded faults are convenient: Easy to create and providinga perfect ground truth, they allow researchers to run ex-periments with very large numbers of faults on almost anysystem. For example, fault localization techniques were of-ten evaluated on artificially seeded single-edit faults, suchas those in the Siemens test suite [18] (e.g., [1, 25, 47, 64, 70]).Researchers have been critical of this style of evaluation,arguing that seeded single-edit faults are not representativeof most real faults (which often require fixes in multiple lo-cations) [28, 80] and that that fault localization techniquesmay not generalize as they are over-optimized in findingsuch simple single-edit faults [64].• In contrast, if curated well, datasets of real faults can bemuchmore representative of realistic usage scenarios. Research onautomated program repair is almost exclusively evaluated ona few dozen to a few hundred real faults [46]. For example,the widely used Defects4J dataset [27] curated 438 faultswith corresponding failing tests from 5 libraries. Creatinghigh-quality datasets of realistic and representative faultsis challenging and typically requires significant human andengineering effort [27, 48, 49, 74]. Therefore, while it is easyto seed millions of faults in almost any program, only fewdatasets of curated real faults are available, often only withmoderate numbers of faults in a small number of librariesor programs. Some researchers warn that overly focusingon few datasets of faults, such as Defects4J, leads to repairapproaches that often overfit the available faults [11, 74].

In this tension between simple seeded faults and expensive tocurate real faults, higher-order mutationmay provide a compromise.Certain kinds of higher-order mutants, in particular SSHOMs thatwe study in this work, are more subtle and hard to kill (shown

bool f(int a, int b):if (a != 1):

return a < breturn a > b

(a) Mutation 1

bool f(int a, int b):if (a == 1):

return a >= breturn a > b

(b) Mutation 2

bool f(int a, int b):if (a != 1):

return a >= breturn a > b

(c) HOM

Figure 2: Suspicious lines based on coverage ranking usingspectrum-based fault localization [25]. Ranking is shown asintensity of danger , suspicious , caution and safe .

both theoretically [14] and empirically [15, 21, 23, 43, 60]). They aremore promising to simulate real faults than traditional first-ordermutants: For example, Zhong and Su [80] and Just et al. [28] foundthat more than 70%, respectively 50%, of real faults are causedby faults in more than two locations. Just et al. [28] also foundthat 73% of real faults are coupled to mutants, while on average2 mutants are coupled to a single real fault. That is, certain kindsof higher-order mutants may be more representative of real faults.Thus, assuming we can find them efficiently, which is the goal ofthis paper, we can still automate their creation and seed thousandsof these more challenging faults in almost any software systems.

Let us illustrate the potential of higher-order mutation for faultlocalization with a small example with 3 existing test cases in Fig-ure 1. The program is mutated into two first-order mutants, whichare later combined to form a higher-order mutant, with test resultsfor each mutant reported in the figure. Note how this higher-ordermutant fails for fewer test cases than the constituent first-ordermutants. In this simple setting, the classic fault localization tech-nique Tarantula [25] works quite well for the first-order mutants,highlighting the mutated lines as shown in Figure 2; but for thehigher-order mutant, Tarantula fails to report the twomutated lines,but instead marking the unchanged line as dangerous. This exampleshows how fault localization fails to locate the faulty lines if the mu-tations are interacting with each other, which, as discussed, may beexpected for realistic faults [28, 80]. As a further consequence, a pro-gram repair technique based on spectrum-based fault localizationmay not even attempt to fix the first return statement [45].

To realize the full potential of higher order mutants for theseand other use cases, it is critical to have an efficient way of findinginteresting higher-order mutants. In this work, we do not reevaluatethe usefulness of HOMs for various use cases [13] or how well theyrepresent real faults [23, 28, 80], which has been studied repeatedlyand comprehensively in prior work [10]. Instead, we focus on atechnical problem that made SSHOMs too costly and impractical:How to efficiently find SSHOMs (and for part of our research alsohow to find all SSHOMs in small to medium-sized programs so thatwe can study their characteristics).

2.2 Strongly Subsuming Higher-Order Mutants(SSHOMs)

Jia and Harman [23] classify higher-order mutants into severalkinds, specifically highlighting SSHOMs as useful. For this reason,our work targets SSHOMs, though we expect that it can be gener-alized to other classes of higher-order mutants. Specifically, Jia andHarman [23] define a SSHOM as a higher-order mutant that can

only be killed by a subset of test cases that kill all its constituentfirst order mutants. More formally, let h be a higher order mutantcomposed of first-order mutants f1, f2, . . . , fn , Th the set of testcases that kill the higher-order mutant h, andTi the set of test casesthat kill the first-order mutant fi , then h is a SSHOM if and only if:

Th , ∅ ∧ Th ⊆⋂

i ∈1...nTi (1)

If we further restrict Th to be a strict subset, we get a evenstronger type of SSHOM, which we call strict strongly subsuminghigher order mutant, denoted as strict-SSHOM.1 In other words,there must be at least one test case that kills a first-order mutant,but not the higher-order mutant. Thus, in a strict-SSHOM, multiplefirst-order mutants interact such that they mask each other at leastfor some test cases, making the strict-SSHOM harder to kill thanall the constituent first-order mutants together.

Our (manually constructed) SSHOM in Figure 1 illustrates thisrelation: Intuitively, the first first-order mutant (replacing ‘==’ by‘!=’) forces the execution to go into an unexpected branch, and thesecond (replacing ‘<’ by ‘>=’) inverts the return values. The twochanges in control and data flow are easy to detect separately (i.e.,killed by two test cases each), but the combination of them is moresubtle and only detected by one test case.

2.3 Finding SSHOMsAn SSHOM is defined in terms of subset relation among mutantskilled by a set of test cases. For a given set of first-order mutants, thesearch space is finite, though very large due to the combinatorialexplosion. Since only few of the combinations are interesting andthose are hard to find in vast search spaces, higher-order mutationtesting has long been considered too expensive.

Jia and Harman [23] explored search techniques to find SSHOMs,finding that genetic search performs best. We will use their genetic-search strategy, together with a brute-force strategy, as baselinesfor our evaluations. Although genetic search has been shown tosuccessfully find SSHOMs, it requires considerable resources to eval-uate many candidates, involves significant randomness, and cannotgive guarantees of completeness, e.g., establish that no SSHOMexists or enumerate them all.

All existing techniques for finding higher-order mutants (includ-ing this work) require executing all first-order mutants with a fixedtest suite as part of constructing higher-order mutants, as we willdiscuss in Section 3. As such, our work is less appealing to thetraditional mutation testing use case of evaluating test suite ade-quacy. However, we argue that finding SSHOMs is still valuable as aresearch tool for fault localization and program repair, as discussedin Section 2.1. In line with our work, dominant mutants have beenmotivated and investigated as a research tool to improve mutationtesting research, which we discuss in Section 7.

1SSHOMs have been defined inconsistently in the literature as subset [17] and strictsubset [21, 23]. We inherit the definition of SSHOMs from Harman et al. [17], as itis the most recent work. As we will see in the evaluation, the difference betweensubset and strict subset is significant, so we make the distinction explicit, introducingstrict-SSHOM as a distinct subclass and reporting results on both.

3 STEP 1: COMPLETE SEARCHWITHVARIATIONAL EXECUTION (searchvar)

In this step, we develop searchvar to compute a complete set ofSSHOMs so that we can study the properties of SSHOMs.

First, given a program under analysis, we generate all first-ordermutants upfront by applying our mutation operators exhaustivelyat every applicable location. We represent each mutant as a Booleanoption and use a ternary conditional operator to encode the change.For example, in the code snippet below, we show how we encodethe two first-order mutants from Figure 1.bool f(int a, int b):

if ( m1 ? a != 1 : a == 1 ):

return m2 ? a >= b : a < b

return a > b

After encoding first-order mutants, we use variational execu-tion as a black-box technique to explore which test cases fail underwhich combinations of first-order mutants. In a nutshell, variationalexecution runs the program under analysis by dynamically trackingthe differences caused by options (similar to executing the programsymbolically with symbolic values for all mutations) [3, 54, 55, 79].Conceptually, a single run of variational execution with optionsis equivalent to running all combinations of options sequentially,but it is usually much faster due to sharing of similar executionsat runtime [54, 55, 79]. For a given test execution, variational exe-cution will return a propositional formula representing exactly thecombinations of options for which the test fails, which we illustratefor our running example in Figure 1 (last table column).

Finally, we collect all propositional failing conditions for all testcases and use them to search for SSHOMs by encoding the searchas a Boolean satisfiability problem. Using BDDs or SAT solvers, wecan then enumerate all solutions, which correspond directly to allSSHOMs. Although the formulas can be large if we have many first-order mutants and test cases and finding satisfiable assignments isNP-hard, modern SAT solving techniques are scalable enough withsmall to medium-sized systems. Our implementation is availableonline. 2

3.1 Mutant GenerationWe represent each first-order mutant with a Boolean option (globalstatic field in Java) and encode the pending change with a ternaryconditional operator. We encode all first-order mutants all at onceinto the program to generate a metaprogram, which is used in ourlater steps for finding SSHOMs. This compact encoding of mutantsdefines a finite set of first-order mutants, which is critical for vari-ational execution to be efficient [78]. Similar encodings have beenexplored in the past in different contexts, such as speeding up mu-tation testing [29, 51, 75]. Using this encoding, we also ensure afair comparison with baseline approaches by excluding compilationtime and using the same metaprograms.

For our experiments, we implemented 3 mutation operators:(1) Arithmetic Operator Replacement (AOR, mutating +, -, *, /, %)(2) Relational Operator Replacement (ROR, mutating ==, !=, <, >, <=,>=) and (3) Logical Connector Replacement (LCR, mutating || and&&). These comprise three of five most useful mutation operators

2https://figshare.com/s/182142e4e7dc3b5981ff

according to Offutt et al. [57, 58], excluding two further based onrecent insights: (4)Absolute Value Insertion (ABS) has been shown tobe less useful in practice [66], so we excluded it to avoid a meaning-lessly large search space. (5) Unary Operator Insertion (UOI) wouldadd many more mutants, most of which are likely equivalent to theones generated from other mutation operators (e.g., mutating a+bto a+-b using UOI is equivalent to a-b using AOR) [34, 51, 66].

We argue that our selection of mutation operators is sufficientas the first step in studying properties of SSHOMs. A recent studyby Kurtz et al. [39] suggests that mutation operators should becarefully chosen for individual programs to maximize the benefitsof mutation testing, while our mutation operators remain a rea-sonable choice across programs. With the goal to uncover generalcharacteristics of SSHOMs, we decided to use this small but wellstudied set of mutation operators.

We generate all possible mutations exhaustively at every applica-ble location in the source code. To apply multiple mutation opera-tors to the same expression, we nest ternary conditional operators.

3.2 Variational ExecutionWe use variational execution to determine which combinations ofmutants fail a test case. The novelty of using variational execu-tion lies in the efficient and complete exploration of all mutants,as opposed to one mutant at a time in traditional search-basedapproaches. For this work, the details of how variational executionworks are not relevant, and we use it as a black-box technique. Here,we only provide an intuition and refer the interested readers toexisting literature for a more in-depth discussion [3, 54, 55, 79].

Variational execution performs computations with conditionalvalues [79], which may represent multiple alternative concrete val-ues. For example, a conditional value <α, 1, -1> indicates that xhas the value 1 if α , and -1 otherwise; conditional values can rep-resent a finite number of alternative concrete values distinguishedby propositional conditions over symbolic options. Variational exe-cution then computes with conditional values and propagates themalong data and control flow, possibly under symbolic path condi-tions. In a nutshell, variational execution can be considered as anextreme design choice among various forms of symbolic programevaluation [5, 6, 8, 33, 72] for finite domains, in which computa-tions are maximally performed on concrete values, but booleansymbolic values may distinguish between multiple concrete valuesper variables [3, 54, 79].

For our purposes, we consider all variables representing first-order mutants as symbolic options (technically as a conditionalvalue <mi, true, false>). This way, all state changes causedby mutants can be compactly tracked, which enables us to exploreall combinations of mutants at the same time. As output, we de-termine under which combinations of mutants a test case fails(propositional formula over first-order mutants as illustrated inFig. 1), by simply observing under which condition any assertedexpression evaluates to false.

In theory, mutant interactions can cause a combinatorial explo-sion in conditional values where an exponentially many alternativevalues for different combinations of mutants need to be tracked fora single variable. However, in practice not all mutants affect eachtest and not all mutants interact, enabling often reasonably efficient

https://figshare.com/s/182142e4e7dc3b5981ff

exploration of all feasible combinations. We defer the discussion ofthis scalability issue to Section 3.4.

Multiple different implementations of variational execution existfor a number of programming languages [3, 4, 31, 54, 55, 71, 79].We use VarexC, a state-of-the-art implementation of variationalexecution for Java, based on bytecode transformation [79]. Forthis work, we extended VarexC to deal with infinite loops that arecaused by some mutations. Existing mutation-testing techniquesoften detect infinite loops by setting timeout on test cases, butgeneralizes poorly to approaches that exploremultiple branches andtrack alternative values. Instead, we count how many basic blockshave been executed and terminate execution in path conditionswhere a threshold is reached (10 million in our experiments, basedon observations of the test cases).

3.3 SSHOMs Search as a Boolean SatisfiabilityProblem

We use the output of variational execution—propositional formu-las indicating under which combinations of mutations each testfails—to construct a single formula that is satisfiable exactly forthose assignments that represent SSHOMs, based on our definitionof SSHOM in Section 2. This way, the search for SSHOMs is trans-formed into a Boolean satisfiability problem, which we can solvewith BDDs or SAT solvers. To derive the formula, we outline thecriteria for identifying SSHOMs as defined by Jia and Harman [23](see Sec. 2.2) and construct a logical expression for each criterion.

LetT be the set of all tests,M be the set of all first order mutants,and ft be the propositional formula over literals fromM describingthe mutant configurations in which test t ∈ T fails (f is generatedwith variational execution, see above). As shorthand, let Γ(m, t) bethe result of evaluating ft with first-order mutantm assigned totrue and all other mutants assigned to false; in other words, whethertest t fails for first-order mutantm. To identify SSHOMs, we encodethree criteria:

(1) The SSHOM must fail at least one test (i.e., must not be anequivalent mutant): ∨

t ∈Tft (2)

This check ensures that a mutant combination is killed by at leastone test, encoding Th , ∅ in Formula 1 (Sec. 2.2).

(2) Every test that fails the SSHOM must fail each constituentfirst order mutant:∧

t ∈T(ft ⇒

∧m∈M(¬m ∨ Γ(m, t)) (3)

If a given mutant combination (i.e., higher-order mutant) is killed bya test t , the same test must kill each constituent first-order mutant.That is, for all tests and first-order mutants, the first-order mutantmust either be killed by the test (Γ(m, t)) or not be part of thehigher-order mutant (¬m). This is the encoding ofTh ⊆

⋂i ∈1...n Ti

in Equation 1 (Sec. 2.2).In addition, we can optimize for SSHOMs that are harder to kill

than the constituent first order mutants, excluding those that areequally difficult to kill [23]. As discussed in Section 2.2, we call thesestrict-SSHOM and require a strict subset relation in Equation 1 (i.e.,

Th ⊂⋂i ∈1...n Ti rather thanTh ⊆

⋂i ∈1...n Ti ), which requires the

additional encoded condition:

(3) There exists a test that can kill all constituent first-ordermutants but cannot kill the strict-SSHOM.∨

t ∈T

(¬ft ∧

∧m∈M(¬m ∨ Γ(m, t))

)(4)

To find SSHOMs and strict-SSHOMs, we take the conjunction ofEquations 2–3 and 2–4, respectively, and use BDD or SAT solverto iterate over all possible solutions. For example, if our approachreturns a satisfiable assignment in whichm1 andm3 are selectedand all other mutants are deselected, then the combination ofm1andm3 is a valid (strict-)SSHOM.

We use BDDs to get satisfiable solutions by default, as VarexCuses BDDs internally to represent propositional formulas. Whileconstructing BDDs can be expensive during variational execution,getting a solution from a BDD is O(n), where n represents the num-ber of Boolean variables [7]. In some rare cases where we cannotcompute a BDD due to issues like insufficient memory, we fall backto using a SAT solver. With a SAT solver, we ask for one possiblesolution, then add the negation of that solution as an additionalconstraint before asking for the next solution, repeating the processuntil all solutions are enumerated. We can usually efficiently enu-merate all possible SSHOMs for the given set of first-order mutantsand the variational-execution result of a given test suite.

3.4 LimitationsWhile variational execution and the SAT encoding provide a newstrategy to enumerate all SSHOMs, this approach comes also withsevere restrictions, mostly regarding scalability and engineeringlimitations inherited from the tools we use, which limits broadapplicability in practice (which we address with an alternativestrategy in Sec. 5).Combinatorial Explosion. Recent studies show that combinato-rial explosion is uncommon for the types of highly-configurableprograms analyzed with variational execution in the past [54, 69],mainly because programs are usually written by human developersto have manageable interactions among options. When applied tohigher-order mutation testing, we did observe some combinatorialexplosion caused by random combinations of first-order mutants.For example, we observed cases where interactions of first-ordermutants create more than 15, 000 alternative concrete values in onesingle local variable. We argue that this is the essential complexityof the mutated program, and it would be equally difficult for otherapproaches to exhaustively explore a complex search space like this.In fact, a recent similar approach that uses SMT solver to detectequivalent mutants has similar scalability issues [42]. However, itis possible to find efficient search strategies when giving up thecompleteness goal, as we will show in Sec. 5.

In the evaluation of searchvar, we manually removed someproblematic first-order mutants and test cases that caused excessivenumber of interactions that exceeded our memory limits (12GB).For fairness, we remove these mutants and test cases across allcompared approaches.Environment Barrier . As other forms of symbolic evaluation,variational execution needs to deal with the environment barrier

carefully when execution interacts with an external runtime en-vironment that is not aware of conditional values or variabilitycontexts. This barrier often manifests as I/O or native method calls.There are several common strategies to mitigate this issue, such ascreating models for these operations [8, 72, 76, 79]. In our study,only few test cases and mutants triggered problematic environmentinteractions. While solvable with more engineering effort, we con-sider them noncritical for our goal and removed the problematictests or mutants after manual inspection.

3.5 EvaluationIn addition to using searchvar to get a complete set of SSHOMs,we compare efficiency and effectiveness of searchvar against theexisting state-of-the-art genetic search (searchgen) and a baselinebrute-force strategy (searchbf), based on subject systems previ-ously used in evaluating the genetic search strategy [17].Subject Systems. We replicate the setup of the most rigorous andlargest previous study on higher-order mutation testing [17]. Whilewe cannot perform an exact replication, since we could not ob-tain the original tools from the authors, not all relevant detailsand parameters have been published, and some engineering limi-tations discussed earlier, we still select the same subject systemsand reimplement mutation operators and search strategies in ourown infrastructure. That is, our results cannot be compared directlyagainst the numbers reported in prior work [17], but we reportcomparable numbers within a consistent setup.

We use the same four small to medium-sized Java programs,Monopoly, Cli, Chess, and Validator, all of which come with goodquality test suites that are deemed complete by developers [17].In addition, we use the triangle program commonly used in mu-tation testing [23]. Statistics of our subject systems are shown inTable 1 (top). In each subject system, we applied all our mutationoperators in all feasible locations, yielding the reported number offirst-order mutants; as discussed in Section 3.4, we had to excludesome mutants and test cases due to engineering limitations.Baseline Search Strategies. We compare our approach againstthe state-of-the-art genetic algorithm [17, 21, 23] and a naive brute-force search. The brute-force search iterates over all possible combi-nations of first-order mutants, starting from all pairs, then all triples,and so on until a time limit is reached. The brute-force search servesas a reliable baseline as there is no randomness involved and thesearch is easy to implement.

We reimplemented the genetic algorithm approach based on thedescription in Jia et al.’s work [20, 21, 23]. As the exact setup wasnot available or documented, we leave undocumented parametersat default values. The core of the genetic algorithm is a fitnessfunction for candidate higher-order mutants. Following existingwork [20, 21, 23] and using the notations in Equation 1, we calculatethe fitness as |Th |

|⋂i∈1. . .n Ti |.3 The intuition is that a SSHOM should

fail only for a subset of test cases that kill all its constituent first-order mutants. Thus, we use it as a piece-wise function: a fitnessof (0, 1] indicates a SSHOM and (0, 1) a strict-SSHOM, with lowerfitness more preferable; a fitness of 0 and larger than 1 indicate

3The fitness function has been defined either using intersect ofTi [20] or union [21, 23].We use the former in our reimplementation as it more precisely captures our intuitionof SSHOMs.

SSHOM Strict-SSHOM

Valid

ator†

10s 1m 10m 1h 12h

050

000

10s 1m 10m 1h 12h

028

1

Chess‡

10s 1m 10m 1h 12h

016

403

10s 1m 10m 1h 12h

021

6

Mon

opoly

10s 1m 10m 1h 12h

081

8

10s 1m 10m 1h 12h

043

CLI

10s 1m 10m 1h 12h

037

6

10s 1m 10m 1h 12h

021

Triang

le10s 1m 10m 1h 12h

096

5

10s 1m 10m 1h 12h

06

searchvar searchgensearchbf searchpri

†We cap the plot for Validator since there are 13.4 billion SSHOMs; ‡ we could notenumerate all nonstrict-SSHOMs for Chess due to the difficulty of the SAT problem

and report only those found within the time limit

Figure 3: (Strict-)SSHOMs found over time in each subjectsystem, averaged over 3 executions. Note that time is plottedin log scale as most SSHOMs are found within the first hour.

potential equivalent mutants and non-SSHOMs, respectively, whichare discarded between generations of the genetic algorithm.Measurements. All experiments were performed on AWS EC2instances, each of which has an Intel 4-core Xeon CPUwith 16GB ofRAM. We ran benchmarks to confirm that the performance is stableenough for our measurements across different instances (especiallygiven that we often demonstrate order-of-magnitude differencesin outcomes, which are unlikely to stem from measurement noise).For each search strategy (i.e., genetic algorithm, brute force, and ourvariational execution approach), we measure each subject systemthree times and report the average, like the three restarts in thework of Harman et al. [17]. We ran each trial of genetic algorithmand brute force for 12 hours.Results. In Table 1, we report the number of (strict-) SSHOMsfound with all three search strategies within the 12-hour time bud-get and in Figure 3, we plot the numbers of found (strict-)SSHOMsover time. Note that by construction, if searchvar terminates (allcases except Chess, where solving the satisfiability problem takesconsiderable time), it enumerates all SSHOMs, thus provides anupper bound for other search strategies—without searchvar thisupper bound would not be known.

These results show clear trends: searchvar requires a relativelylong time to find the first SSHOM, because variational execution

Table 1: Subjects and Found (strict-)SSHOMs; the last three subjects and the priority strategy are discussed in Section 5.

Found SSHOM Found strict-SSHOM

Subject LOC Tests (%used) FOMs (%used) Var Gen BF Pri Var Gen BF Pri

Validator 7,563 302 (83%) 1941 (97%) 1.34 * 1010 4,041 273 36,995 281 0 4 10Chess 4,754 847 (84%) 956 (26%) 3268† 484 19 16,403 216 0 6 24Monopoly 4,173 99 (89%) 366 (90%) 818 81 349 817 43 4 15 43Cli 1,585 149 (95%) 249 (51%) 376 309 326 369 21 18 21 21Triangle 19 26 (100%) 128 (100%) 965 949 493 965 6 6 6 6

Ant 108,622 1354 (77%) 18,280 (92%) - 1 0 44,496 - 0 0 61Math 104,506 5177 (79%) 103,663 (100%) - 0 0 390,533 - 0 0 2,830JFreeChart 90,481 2169 (99%) 36,307 (99%) - 0 6 576,725 - 0 0 513

LOC represents lines of code, excluding test code, measured with sloccount. Tests and FOMs report the numbers of test cases and first-order mutants we used in experiments, withthe percentages relative to the total numbers in parentheses. Var, Gen, BF, Pri denote our approach (Step 1, searchvar), the genetic algorithm (searchgen), brute force (searchbf),

and our prioritized search (Step 3, searchpri) respectively.† incomplete results, solutions found with SAT solving within the 12 hours budget.

must finish executing all tests for all combinations of first-order mu-tants. However, once variational execution finishes, it can enumer-ate all SSHOMs very quickly by solving the Boolean satisfiabilityproblem. Variational execution takes longer with more and longertest cases and with more first-order mutants, but still outperformsa brute-force execution by far, indicating significant sharing, asfound in prior analyses of highly-configurable systems [54, 55, 79].

In contrast, searchgen and searchbf can test many candidateSSHOMs before variational execution terminates and finds someactual SSHOMs early, but both approaches take a long time to find asubstantial number of SSHOMs and miss at least some SSHOMs inall subject systemwithin the 12h time budget given. In some systemswith moderate numbers of first-order mutants, searchbf is fairlyeffective as it systematically prioritizes pair-wise combinationswhich are more common among SSHOMs than combinations ofmore than two mutants, as we will discuss.

In summary, for systems where variational execution scales,searchvar can find all SSHOMs whereas other approaches findonly an often much smaller subset within a 12h time window.Whereas prior approaches often find their first SSHOMs faster,searchvar needs more time upfront for variational execution butcan then enumerate SSHOMs very quickly. To scale searchvar tomore realistic programs, more engineering is needed to overcomethe limitations discussed in Sec. 3.4. Nevertheless, searchvar isvaluable to the research community as it provides a precise andefficient way of identifying all SSHOMs.

4 STEP 2: SSHOM CHARACTERISTICSIn a second step, we study the characteristics of (strict-) SSHOMs,with the goal to inform subsequent heuristic search strategies(Step 3) and future research in general. Using the complete setderived for the subject systems in the previous step, rather than a(potentially biased) sample of SSHOMs, we can study characteristicswith higher confidence.

We explored the dataset in an iterative exploratory fashion, fo-cusing primarily on characteristics that may guide future searchstrategies, such as specific composition patterns and proximity of

constituent first-order mutants for the set of all higher-order mu-tants. Kurtz et al. [39] argue that mutation operators should bespecialized for individual programs, so we focus on high-level char-acteristics that are largely independent of specific mutation opera-tors to avoid overfitting. We started by randomly sampling a largenumber of identified SSHOMs (among the pool of all SSHOMs). Wemanually inspected the sampled SSHOMs to pose hypotheses aboutcommon characteristics. We then operationalized the hypothesizedcharacteristics (i.e., develop measures to apply across all SSHOMs)to quantitatively validate them. We repeated the process until wecould not identify additional hypotheses. Due to space constraints,we only report characteristics for which we could quantitativelyidentify strong support.Mutation Order . SSHOMs and strict-SSHOMs are typically com-posed of only few first-order mutants. Overall, over 90% of allSSHOMs and strict-SSHOMs are composed of at most 4 first-ordermutants, indicating that subtle interactions are mostly caused byfew first-order mutants. Although few SSHOMs were composedof up to 6 first-order mutants (in Chess and Triangle), such casesare rare, especially for strict-SSHOMs. We plot the distribution oforders for both SSHOMs and strict-SSHOMs in Table 2.Equivalent Test Failures. In multiple subject systems, manySSHOMs and strict-SSHOMs are composed of first-order mutantsthat are killed by the exact same set of test cases (nonstrict-SSHOMsare often killed by the same test cases, whereas strict-SSHOMs nec-essarily are killed by fewer). In Table 2, we report how many ofthe SSHOMs and strict-SSHOMs in each project could be foundwhen only combining first-order mutants that are killed by theexact same test cases, which we name Equal-Fail SSHOMs.Containment Relationships. In addition, we found a commoncontainment pattern: when a (strict-)SSHOM is composed of morethan two first-order mutants, it is very likely that a subset of thesefirst-order mutants also form a (strict-)SSHOM. In other words, anN+1 Rule, combining a previously identified (strict-)SSHOM withone further first-order mutant is a promising strategy to identifymore (strict-)SSHOMs. In Table 2, we report how many of the

Table 2: Characteristics of SSHOMs and strict-SSHOMs found in our subject systems.

Order Equal-Fail Rule N+1 Rule Distribution

Subject SSHOM strict-SSHOM SSHOM strict-SSHOM SSHOM strict-SSHOM SSHOM strict-SSHOM

Validator† - 2 4 - 96% - 99% - M 2C

Chess† - 2 - 76% - 38% - M C

Monopoly 2 3 4 5 2 3 11% 100% 99% 100% M C 2C M

Cli 2 3 4 2 3 4 53% 5% 98% 100% M C 2C * M C 2C

Triangle 2 3 4 5 6 2 3 4 5 8% 17% 98% 50% M M

Order counts number of constituent first-order mutants; equal-fail and N+1 rule explained in text; distribution: all constituent first-order mutants in same method (M), multiplemethods in the same class (C), two classes (2C), or spread across more than two classes (*).

† for Validator and Chess we omit statistics, because we cannot enumerate all possible SSHOMs (too many in Validator and incomplete set in Chess)

(strict-)SSHOMs in each project with more than two constituentfirst-order mutants could be generated with such a rule.Proximity. Finally, for most SSHOMs, all constituent first-ordermutants are in the same class and often even in the same method,likely because first-order mutants with close proximity have higherchances of data-flow or control-flow interactions. The effect iseven more pronounced for strict-SSHOMs. This stronger effectwas previously conjectured though not validated [23]. We plot thedistributions for all subject systems in Table 2.Other . We also explored other patterns that may inform searchheuristics, such as common combinations of mutation operators(using frequent-itemset mining [2]), but found no additional strongpatterns. While we believe a qualitative analysis of the mutants andtheir characteristics may reveal interesting insights about SSHOMsand whether they more closely mirror realistic human-made faults,such analysis goes beyond our scope of finding SSHOMs efficiently.

5 STEP 3: CHARACTERISTICS-BASEDPRIORITIZED SEARCH HEURISTIC(searchpri)

In a final third step, we develop a new search strategy using heuris-tics based on characteristics found in Step 2, which will be anincomplete, but practical alternative to our searchvar strategy.

5.1 Search StrategyOur new search strategy searchpri avoids the overhead of varia-tional execution, but instead again evaluates each candidate higher-order mutant by executing the corresponding test suite, one can-didate mutant at a time just like searchbf and searchgen. Ourkey contribution is ordering how we explore candidate mutantsto steer the search toward more likely candidates. That is, insteadof a naive enumeration of all combinations (searchbf) or an ex-ploration based on random seeds (searchgen), we prioritize basedon the previously identified typical characteristics of higher-ordermutants. Since characteristics for SSHOM and strict-SSHOM donot differ strongly, we develop only a single search strategy.

Conceptually, we calculate a penalty for every candidate higher-ordermutant and prioritize those candidateswith the lowest penalty.We compute the weighted sum of three factors:

penalty = ω1 · order + ω2 · testDiff − ω3 · isN1 (5)

First, we assign penalties based on the number of constituent first-order mutants (order): a candidate with a higher order receivesa larger penalty than a lower-order candidate, thus, prioritizingcandidates with lower order that, as our data shows, are more likelyto be SSHOMs. Second, we penalize candidates constructed fromfirst-order mutants that do not get killed by the same test cases(testDiff, counting the number of test cases that can kill only a subsetof all constituent first order mutants), generalizing our EquivalentTest Failures insight: if all first-order mutants are killed by the exactsame test cases, the candidate is likely to be a SSHOM, and thusgets a 0 penalty, whereas mutants that are killed by different testcases are less likely to form a SSHOM, and thus is deferred witha higher penalty. Finally, we reduce the penalty of a candidate ifthe N+1 Rule applies (isN1, returning 1 or 0); that is, if a candidatecan be constructed by adding one more first-order mutant to aknown SSHOM, the candidate receives a boost and gets prioritized.By default and for our evaluation, we assign the weights ω1 = 5,ω2 = 1, and ω3 = 15, based on our experience with the subjectsystems in Section 4.

Unlike previously used genetic search strategies, where the ex-ploration order nondeterministically depends on random mutationand crossover in every generation, searchpri explores candidatesin a deterministic order (lexical order if two candidates have thesame priority).

5.2 ImplementationSince we cannot enumerate and sort all possible candidate higher-order mutants for large programs, and even the execution of allfirst-order mutants may take a long time, we devise an algorithmfor searchpri that identifies likely candidates in batches, shown inFigure 4. In each batch (configurable, by default one Java packageat a time), we enumerate all candidate higher-order mutants up toa distance and order bound, then sort these candidates by priority,and finally explore these candidates in order until a (time) budgetis reached for that batch. Batching and bounding the search isfeasible since the order and distribution characteristics dominate theprioritization anyway and candidates beyond those bounds wouldbe explored only very late. If needed batches could be revisited laterwith larger bounds to explore more (less likely) candidates.

After batching, our algorithm identifies all first-order mutantsdefined within the given batch (function reachable) and runs the testsuite for each of these first-order mutants to identify which tests

def findSSHOMs(program P, mutants M, testsuite T,maxOrder, maxDist, budget):

foundSSHOMs = ∅# explore the program one fragment at a timefor (batch ← fragments(P)):# identify reachable first-order mutants in fragmentmutants = reachable(M, batch)# run tests on reachable first-order mutantsfomTestResults = for (m ← mutants) evaluate(T, {m})

# enumerate candidate SSHOMs up to order and distance boundscandidates = enumerateCandidates(mutants, maxOrder, maxDist)# compute priorities for each candidatepriorities = computePriorities(candidates, fomTestResults, {})

# explore candidates in decreasing prioritywhile (candidates , ∅ ∧ within budget):candidate = getNext(candidates, priorities)candidates -= candidatehomTestResult = evaluate(T, candidate)if (isSSHOM(fomTestResults, homTestResult)):foundSSHOMs += candidate# update priorities based on N+1 rulepriorities = computePriorities(candidates, fomTestResults,

foundSSHOMs)return foundSSHOMs

Figure 4: Characteristics-based prioritized search algorithm.

fail (function evaluate). Subsequently, the algorithm enumerates allcandidates (function enumerateCandidates) up to a given order bound(by default, mutants composed of up to 6 first-order mutants) andup to a given distance bound (by default, up to 4 methods spreadacross at most 3 classes). Having a manageable set of candidatesin the given batch, the algorithm computes priorities (functioncomputePriorities) for all candidates using Equation 5 and then ex-plores these candidates in order of decreasing priorities (functiongetNext) until either all candidates are explored or a (time) budgethas been reached in that batch (by default, 1 hour per batch). Foreach candidate, it runs the test suite and compares test results to de-termine whether a (strict) SSHOM has been found (function isSSHOM);identified SSHOMs are collected and used to recompute prioritiesbased on additional information for the N+1 rule.

5.3 EvaluationWe evaluate how effective our new search heuristic searchpri is atfinding (strict-)SSHOMs, and additionally evaluate how it general-izes and scales to much larger systems than the ones used in priorstudies on SSHOMs (and used in Sec. 4).Subject Systems. We evaluate searchpri both on the subjectspreviously used in Section 4 and on a fresh set of much larger subjectsystems. The comparison against the 5 previously used subjectsystems allows us to compare effectiveness against the groundtruth derived from variational execution, but the results may sufferfrom overfitting, as we evaluate the search strategy on systemsfrom which the insights that drive its design have been derived.

Hence, we use 3 additional subjects, listed in Table 1 (bottom), af-ter finishing the design of our new search strategy. The new systemsare significantly larger, allowing us to explore the different searchstrategies at a much larger (and possibly more realistic) scale. Toselect the new subject systems, we collected all research papers pub-lished in the last 5 years at ASE, FSE, and ICSE that have the word“mutation” or “mutant” in the title. We then selected the five largestJava systems used, discarding two for which we failed to reliably

SSHOM Strict-SSHOM

Ant

0h 6h 12h 18h 24h

044

507

0h 6h 12h 18h 24h

062

Math

0h 6h 12h 18h 24h

039

0533

0h 6h 12h 18h 24h

028

30

JFreeC

hart

0h 6h 12h 18h 24h

057

6725

0h 6h 12h 18h 24h

051

3

searchgen searchbf searchpri

Figure 5: (Strict-)SSHOMs found over time, averaged over3 executions. Note that time is plotted in linear scale asSSHOMs are found consistently over time due to batching.

execute the tests. We did not run searchvar on these systems, butwe still had to exclude some tests or mutants (reported in Table 1),due to technical issues like hard-to-terminate infinite loops.Measurements. We mirror our previous setup in Section 3.5 andcount the number of (strict-)SSHOMs found over time. We collectmeasurements for searchbf, searchgen, and searchpri. We omitsearchvar for the new systems due to engineering and scalabilityissues discussed in Section 3.4, especially issues with environmentbarrier. Experiments on the small subject systems were performedon the same AWS EC2 instances (Section 3.5). For the new systems,we collected measurements on Linux machines with 1.30GHz Inteli5 CPU and 16GB memory. When using searchpri, we did not needto perform batching for the small subject systems; we used batchingfor the new larger subject systems, one package at a time, with a1 hour budget for each package; all other parameters were left attheir defaults (described above). For the new subject systems, weran each measurement for 24 hours, repeated searchgen 3 times.

All search strategies (except searchvar, not considered here) re-quire executing the test suite repeatedly for each candidate SSHOM.For the larger systems, long test-execution times severely limit thenumber of mutants we can explore. To minimize the slowdownfrom test execution that affects all approaches equally, we imple-ment a standard regression test selection technique [61] that onlyexecutes test cases that can reach the candidate mutant (technically,we instrument the program to record which test reaches the loca-tion of each first-order mutant and only execute tests that reach atleast one first-order mutant of a candidate higher-order mutant).We apply this test optimization for all search strategies.Results. On the small subject systems, as shown in Table 1 andFigure 3, our new search strategy searchpri is often very effective,performing at least as well as and usually significantly outperform-ing both searchbf and searchgen in all subjects. In a few cases,it even outperforms searchvar: In Monopoly it finds almost allhigher-order mutants before variational execution finishes running

the tests and in Chess it finds SSHOMs quickly, not limited by theeffort to solve large satisfiability problems.

For the new and larger systems, our results shown in Table 1 andFigure 5 show that the baseline approaches perform very poorly atthis scale. Without being informed by SSHOM characteristics thesearch in this vast space (e.g., 5 billion candidate combinations ofmutation pairs in Math) these approaches find rarely any SSHOMseven when run for a long time. In contrast, searchpri finds a signif-icant number of (strict-)SSHOMs in each of these systems: Within24 hours it explores most batches (91 % of all packages) and has areasonable precision 4 for finding actual SSHOMs among the testedcandidates (60.9 % in Math, 29.4 % in Ant, and 77.8 % in JFreeChart).

We conclude that searchpri is an effective search strategy thatscales to large systems and generalizes beyond systems from whichthe characteristics have been collected. While we cannot assesshow many SSHOMs we are missing, our strategy is effective atfinding a very large number of them in a short amount of time.

6 THREATS TO VALIDITYExternal validity might be limited by the specific programs, muta-tion operators and test cases.We used subject systems from previouspapers to avoid any own sampling bias. From most subject systems,we had to remove some tests or mutations due to technical problems,either engineering limitations of variational execution or issueswith memory leaks and infinite loops, which might affect the resultsto some degree—though we do not expect to see a systematic bias.

Our study only considers three representative mutation opera-tors among all possible ones [61] and may not generalize to otheroperators. A further analysis of the sensitivity of SSHOMs to a widearray of mutation operators is outside the scope of this paper.

Regarding internal validity, like other studies, our results mightbe affected by possible mistakes in our implementations or measure-ments and especially by we reimplemented the existing searchgenapproach. To mitigate this issue, we verified that the SSHOMs foundby searchgen and searchbf are a strict subset of the ones foundby searchvar. For SSHOMs found only by our approach, we addi-tionally verified a sample manually to ensure they are SSHOMs.

To reduce the impact of nondeterminism in performance mea-surements and genetic search, we report averages across 3 runs, asin previous work [17]. Most differences are large, far exceeding themargins of error from nondeterminism or measurement noise.

As in previous work, SSHOMs should not be affected by equiv-alent mutants, because their specification (Equation 1) explicitlyrequires at least one test to fail for all combined first-order mutants.In contrast to Harman et al. [17], we do not try to establish howmany of our first-order mutants are equivalent mutants, becausewe do not compute any metrics based on the number of first-ordermutants (such as ‘test effectiveness’ in prior studies [17]).

Finally, it would be possible to improve searchbf and searchgenby applying insights from our research, such as a similar batch-ing strategy to explore one Java package at a time and possiblyalso other insights from analyzing SSHOM characteristics. When

4It is difficult to fairly compare the precision of searchpri with searchvar andsearchbf , because searchpri is optimized to find SSHOMs early while searchvarmight get better over time after a few generations. Eventually, the precision of allsearch-based approaches converge to almost 0 for the smaller subjects after prolongedsearching. We provide more data about precision in the appendix.

using batching (results not shown), these approaches indeed per-form better on the large subject systems but are still significantlyoutperformed by searchpri.

7 RELATEDWORKIn this section, we focus our discussions on higher-order mutationtesting, and refer interested readers to a detailed survey for recentadvances in mutation testing in general [61].Approaches for Finding SSHOMs. Early work has investigateddifferent strategies to combine first-order mutants into second-order mutants [35, 50, 53]. Jia and Harman extended this effortto even higher orders using heuristic search looking for certainkinds of valuable higher-order mutants, specifically SSHOMs. Theycompare a greedy, a hill-climbing, and a genetic algorithm andfound that genetic search produces the best results for findingSSHOMs [21, 23]. Since then, higher-order mutation testing hasbeen implemented in different mutation testing tools and frame-works, for different languages [22, 41, 44, 52, 59, 63, 73], usuallyusing some form of heuristic search [21, 23, 43, 60]. Although thiswork specifically targets SSHOMs, our approach can be general-ized to other types of interesting mutants, by updating the way weencode the search as a Boolean satisfiability problem.

Orthogonal to SSHOMs, researchers have recently investigatedanother interesting type of hard-to-kill mutants called dominatormutants [37, 38]. This work searches for the hardest-to-kill mutantsamong a set of first-order mutants, by comparing executions witha given fixed test suite. Despite the heavy cost of computing dom-inator mutants, they have been shown to be an effective researchtool to study existing mutation testing techniques, for example forgauging mutation test completeness [40] and evaluating selectivemutation [39]. More recently, Just et al. [30] show that programcontext can be used to approximate dominator mutants, whichmight also be promising for future search strategies for SSHOMs.Characteristics of SSHOMs. Existing studies on SSHOMs mostlyconcern the quantity of SSHOMs and difficulty of finding them [17,21–23, 43]. For example, Harman et al. [17] discussed how SSHOMsrelate to their constituent first-order mutants, but their discussionfocuses mainly on test effectiveness and efficiency. Jia and Harman[23] discussed characteristics of a single SSHOM in the Triangleprogram (also used in our study), but did not explore SSHOM charac-teristics further. In our work, we can find a complete set of SSHOMs,which provides us more data to study what they look like.Variational Execution. Variational execution was originally de-veloped for information-flow analysis [3] and configuration test-ing [31, 55]. In a new-idea paper, Wong et al. [78] recently suggestedthat variational execution may have additional application scenar-ios, suggesting mutation testing as explored in Step 1 of this paperas one promising direction.

With regard to using variational execution for mutation testing,Devroey et al. [9] are conceptually closest to our work in that theypursue a complete exploration strategy with similarities to lazy con-figuration exploration in SPLat [32, 54]. However, they explore onlytraces in state machines without any joining and thus forgo muchpossible sharing. Their analysis does not distinguish first-order fromhigher-order mutants and does not identify or analyze SSHOM. Sev-eral other researchers have also used advanced dynamic analyses to

speed up the execution of tests in traditional mutation testing (onemutation at a time), looking for possible redundancies and joins[24, 26, 77]. Since our main goal of using variational execution isto explore interactions of first-order mutants rather than speed upmutation analysis, we did not perform a performance comparison.

8 CONCLUSIONSTo efficiently find SSHOMs, we proceed in three steps. First, weuse variational execution to find all SSHOMs in small to medium-sized programs. Second, we analyze basic characteristics of theidentified SSHOMs. Finally, we derive a new prioritized searchstrategy based on the characteristics. The prioritized search scalesto large systems and is effective (albeit not complete) at findingSSHOMs and outperforms the existing state-of-the-art strategy byfar. We hope that the insights and search strategies from this workcan support future work in mutation testing.

REFERENCES[1] Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2007. On the Accuracy of

Spectrum-Based Fault Localization. In Testing: Academic and Industrial ConferencePractice and Research Techniques-MUTATION (TAICPART-MUTATION 2007). IEEE,89–98.

[2] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining AssociationRules Between Sets of Items in Large Databases. In Proceedings of the 1993 ACMSIGMOD International Conference on Management of Data (SIGMOD ’93). ACM,Washington, D.C., USA, 207–216. https://doi.org/10.1145/170035.170072

[3] Thomas H. Austin and Cormac Flanagan. 2012. Multiple Facets for DynamicInformation Flow. In Proceedings of the 39th Annual ACM SIGPLAN-SIGACTSymposium on Principles of Programming Languages (POPL ’12). ACM, New York,NY, USA, 165–178. https://doi.org/10.1145/2103656.2103677

[4] Thomas H. Austin, Jean Yang, Cormac Flanagan, and Armando Solar-Lezama.2013. Faceted Execution of Policy-Agnostic Programs. In Proceedings of the EighthACM SIGPLAN Workshop on Programming Languages and Analysis for Security(PLAS ’13). ACM, New York, NY, USA, 15–26. https://doi.org/10.1145/2465106.2465121

[5] Armin Biere, Alessandro Cimatti, Edmund Clarke, and Yunshan Zhu. 1999. Sym-bolic model checking without BDDs. In International conference on tools andalgorithms for the construction and analysis of systems. Springer, 193–207.

[6] James Bornholt and Emina Torlak. 2018. Finding code that explodes undersymbolic evaluation. Proceedings of the ACM on Programming Languages 2,OOPSLA (2018), 149.

[7] Bryant. 1986. Graph-Based Algorithms for Boolean Function Manipulation. IEEETrans. Comput. C-35, 8 (Aug. 1986), 677–691. https://doi.org/10.1109/TC.1986.1676819

[8] Marcelo d’Amorim, Steven Lauterburg, and Darko Marinov. 2008. Delta Execu-tion for Efficient State-Space Exploration of Object-Oriented Programs. IEEETransactions on Software Engineering (TSE) 34, 5 (2008), 597–613.

[9] Xavier Devroey, Gilles Perrouin, Mike Papadakis, Axel Legay, Pierre-YvesSchobbens, and Patrick Heymans. 2016. FeaturedModel-BasedMutation Analysis.In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). ACM, New York, NY, USA, 655–666. https://doi.org/10.1145/2884781.2884821

[10] Jackson Antonio do Prado Lima and Silvia Regina Vergilio. 2019. A SystematicMapping Study on Higher Order Mutation Testing. Journal of Systems andSoftware (JSS) 154 (2019), 92–109.

[11] Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019.Empirical review of Java program repair tools: a large-scale experiment on 2,141bugs and 23,551 repair attempts. In Proceedings of the 2019 27th ACM Joint Meetingon European Software Engineering Conference and Symposium on the Foundationsof Software Engineering. 302–313.

[12] Ahmed S. Ghiduk. 2016. Reducing the Number of Higher-Order Mutants withthe Aid of Data Flow. e-Informatica Vol. X (2016), 2016; ISSN 18977979. https://doi.org/10.5277/E-INF160102

[13] Ahmed S. Ghiduk, Moheb R. Girgis, and Marwa H. Shehata. 2017. Higher OrderMutation Testing: A Systematic Literature Review. Computer Science Review 25(Aug. 2017), 29–48. https://doi.org/10.1016/j.cosrev.2017.06.001

[14] R. Gopinath, C. Jensen, and A. Groce. 2017. The Theory of Composite Faults. In2017 IEEE International Conference on Software Testing, Verification and Validation(ICST). 47–57. https://doi.org/10.1109/ICST.2017.12

[15] M. Harman, Y. Jia, and W. B. Langdon. 2010. A Manifesto for Higher OrderMutation Testing. In 2010 Third International Conference on Software Testing,

Verification, and Validation Workshops. 80–89. https://doi.org/10.1109/ICSTW.2010.13

[16] Mark Harman, Yue Jia, and William B. Langdon. 2011. Strong Higher OrderMutation-Based Test Data Generation. In Proceedings of the 19th ACM SIG-SOFT Symposium and the 13th European Conference on Foundations of Soft-ware Engineering (ESEC/FSE ’11). ACM, New York, NY, USA, 212–222. https://doi.org/10.1145/2025113.2025144

[17] Mark Harman, Yue Jia, Pedro Reales Mateo, and Macario Polo. 2014. Angelsand Monsters: An Empirical Investigation of Potential Test Effectiveness andEfficiency Improvement from Strongly Subsuming Higher Order Mutation. InProceedings of the 29th ACM/IEEE International Conference on Automated SoftwareEngineering - ASE ’14. ACM Press, Vasteras, Sweden, 397–408. https://doi.org/10.1145/2642937.2643008

[18] Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand. 1994. Exper-iments on the Effectiveness of Dataflow-and Control-Flow-Based Test AdequacyCriteria. In Proceedings of the International Conference on Software Engineering(ICSE). IEEE, 191–200.

[19] Chihiro Iida and Shingo Takada. 2017. Reducing Mutants with Mutant Kil-lable Precondition. In 2017 IEEE International Conference on Software Testing,Verification and Validation Workshops (ICSTW). IEEE, Tokyo, 128–133. https://doi.org/10.1109/ICSTW.2017.29

[20] Yue Jia. [n.d.]. Higher Order Mutation Testing. Ph.D. Dissertation.[21] Yue Jia and Mark Harman. 2008. Constructing Subtle Faults Using Higher Order

Mutation Testing. In 2008 Eighth IEEE International Working Conference on SourceCode Analysis and Manipulation. IEEE, Beijing, 249–258. https://doi.org/10.1109/SCAM.2008.36

[22] Y. Jia and M. Harman. 2008. MILU: A Customizable, Runtime-Optimized HigherOrder Mutation Testing Tool for the Full C Language. In Testing: AcademicIndustrial Conference - Practice and Research Techniques (Taic Part 2008). 94–98.https://doi.org/10.1109/TAIC-PART.2008.18

[23] Yue Jia and Mark Harman. 2009. Higher Order Mutation Testing. Informationand Software Technology 51, 10 (Oct. 2009), 1379–1393. https://doi.org/10.1016/j.infsof.2009.04.016

[24] Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development ofMutation Testing. IEEE Transactions on Software Engineering (TSE) 37, 5 (Sept.2011), 649–678. https://doi.org/10.1109/TSE.2010.62

[25] James A Jones and Mary Jean Harrold. 2005. Empirical Evaluation of the Taran-tula Automatic Fault-Localization Technique. In Proceedings of the InternationalConference on Automated Software Engineering (ASE). 273–282.

[26] René Just, Michael D. Ernst, and Gordon Fraser. 2014. Efficient Mutation Analysisby Propagating and Partitioning Infected Execution States. In Proceedings of the2014 International Symposium on Software Testing and Analysis - ISSTA 2014. ACMPress, San Jose, CA, USA, 315–326. https://doi.org/10.1145/2610384.2610388

[27] René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Databaseof Existing Faults to Enable Controlled Testing Studies for Java Programs. InProceedings of the 2014 International Symposium on Software Testing and Analysis(ISSTA 2014). ACM, San Jose, CA, USA, 437–440. https://doi.org/10.1145/2610384.2628055

[28] René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, andGordon Fraser. 2014. Are Mutants a Valid Substitute for Real Faults in SoftwareTesting?. In Proceedings of the 22Nd ACM SIGSOFT International Symposiumon Foundations of Software Engineering (FSE 2014). ACM, New York, NY, USA,654–665. https://doi.org/10.1145/2635868.2635929

[29] René Just, Gregory M. Kapfhammer, and Franz Schweiggert. 2011. Using Condi-tional Mutation to Increase the Efficiency of Mutation Analysis. In Proceedings ofthe 6th International Workshop on Automation of Software Test (AST ’11). ACM,New York, NY, USA, 50–56. https://doi.org/10.1145/1982595.1982606

[30] René Just, Bob Kurtz, and Paul Ammann. 2017. Inferring Mutant Utility fromProgram Context. In Proceedings of the 26th ACM SIGSOFT International Sym-posium on Software Testing and Analysis (ISSTA 2017). ACM, Santa Barbara, CA,USA, 284–294. https://doi.org/10.1145/3092703.3092732

[31] Chang Hwan Peter Kim, Sarfraz Khurshid, and Don Batory. 2012. Shared Exe-cution for Efficiently Testing Product Lines. In Proceedings of the InternationalSymposium on Software Reliability Engineering (ISSRE) (Dallas, TX, USA). IEEEComputer Science, 221–230. https://doi.org/10.1109/ISSRE.2012.23

[32] Chang Hwan Peter Kim, Darko Marinov, Sarfraz Khurshid, Don Batory, SabrinaSouto, Paulo Barros, and Marcelo d’Amorim. 2013. SPLat: Lightweight DynamicAnalysis for Reducing Combinatorics in Testing Configurable Systems. In Pro-ceedings of the European Software Engineering Conference/Foundations of SoftwareEngineering (ESEC/FSE). ACM, 257–267.

[33] James C. King. 1976. Symbolic Execution and Program Testing. Commun. ACM19, 7 (July 1976), 385–394. https://doi.org/10.1145/360248.360252

[34] Kim N King and A Jefferson Offutt. 1991. A fortran language system for mutation-based software testing. Software: Practice and Experience 21, 7 (1991), 685–718.

[35] M. Kintis, M. Papadakis, and N. Malevris. 2010. Evaluating Mutation TestingAlternatives: A Collateral Experiment. In 2010 Asia Pacific Software EngineeringConference. 300–309. https://doi.org/10.1109/APSEC.2010.42

https://doi.org/10.1145/170035.170072

https://doi.org/10.1145/2103656.2103677

https://doi.org/10.1145/2465106.2465121

https://doi.org/10.1145/2465106.2465121

https://doi.org/10.1109/TC.1986.1676819

https://doi.org/10.1109/TC.1986.1676819

https://doi.org/10.1145/2884781.2884821

https://doi.org/10.5277/E-INF160102

https://doi.org/10.5277/E-INF160102

https://doi.org/10.1016/j.cosrev.2017.06.001

https://doi.org/10.1109/ICST.2017.12

https://doi.org/10.1109/ICSTW.2010.13


https://doi.org/10.1145/2025113.2025144

https://doi.org/10.1145/2025113.2025144

https://doi.org/10.1145/2642937.2643008

https://doi.org/10.1145/2642937.2643008



https://doi.org/10.1109/SCAM.2008.36

https://doi.org/10.1109/SCAM.2008.36

https://doi.org/10.1109/TAIC-PART.2008.18

https://doi.org/10.1016/j.infsof.2009.04.016


https://doi.org/10.1109/TSE.2010.62

https://doi.org/10.1145/2610384.2610388

https://doi.org/10.1145/2610384.2628055

https://doi.org/10.1145/2610384.2628055

https://doi.org/10.1145/2635868.2635929

https://doi.org/10.1145/1982595.1982606

https://doi.org/10.1145/3092703.3092732

https://doi.org/10.1109/ISSRE.2012.23

https://doi.org/10.1145/360248.360252

https://doi.org/10.1109/APSEC.2010.42

[36] Marinos Kintis, Mike Papadakis, and Nicos Malevris. 2015. Employing Second-Order Mutation for Isolating First-Order Equivalent Mutants. Software Testing,Verification and Reliability 25, 5-7 (Aug. 2015), 508–535. https://doi.org/10.1002/stvr.1529

[37] Bob Kurtz, Paul Ammann, Marcio E. Delamaro, Jeff Offutt, and Lin Deng. 2014.Mutant Subsumption Graphs. In 2014 IEEE Seventh International Conference onSoftware Testing, Verification and Validation Workshops. IEEE, OH, USA, 176–185.https://doi.org/10.1109/ICSTW.2014.20

[38] B. Kurtz, P. Ammann, and J. Offutt. 2015. Static Analysis of Mutant Subsumption.In 2015 IEEE Eighth International Conference on Software Testing, Verificationand Validation Workshops (ICSTW). 1–10. https://doi.org/10.1109/ICSTW.2015.7107454

[39] Bob Kurtz, Paul Ammann, Jeff Offutt, Márcio E. Delamaro, Mariet Kurtz, andNida Gökçe. 2016. Analyzing the Validity of Selective Mutation with DominatorMutants. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium onFoundations of Software Engineering (FSE 2016). ACM, Seattle, WA, USA, 571–582.https://doi.org/10.1145/2950290.2950322

[40] Bob Kurtz, Paul Ammann, Jeff Offutt, and Mariet Kurtz. 2016. Are We ThereYet? How Redundant and Equivalent Mutants Affect Determination of TestCompleteness. In 2016 IEEE Ninth International Conference on Software Testing,Verification and Validation Workshops (ICSTW). IEEE, Chicago, IL, USA, 142–151.https://doi.org/10.1109/ICSTW.2016.41

[41] M. Kusano and Chao Wang. 2013. CCmutator: A Mutation Generator for Concur-rency Constructs in Multithreaded C/C++ Applications. In 2013 28th IEEE/ACMInternational Conference on Automated Software Engineering (ASE). 722–725.https://doi.org/10.1109/ASE.2013.6693142

[42] B. Kushigian, A. Rawat, and R. Just. 2019. Medusa: Mutant Equivalence DetectionUsing Satisfiability Analysis. In 2019 IEEE International Conference on SoftwareTesting, Verification and Validation Workshops (ICSTW). 77–82. https://doi.org/10.1109/ICSTW.2019.00035

[43] William B. Langdon, Mark Harman, and Yue Jia. 2010. Efficient Multi-ObjectiveHigher Order Mutation Testing with Genetic Programming. Journal of Systemsand Software 83, 12 (Dec. 2010), 2416–2430. https://doi.org/10.1016/j.jss.2010.07.027

[44] Duc Le, Mohammad Amin Alipour, Rahul Gopinath, and Alex Groce. 2014.MuCheck: An Extensible Tool for Mutation Testing of Haskell Programs. InProceedings of the 2014 International Symposium on Software Testing and Analy-sis (ISSTA 2014). ACM, New York, NY, USA, 429–432. https://doi.org/10.1145/2610384.2628052

[45] Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer.2012. A Systematic Study of Automated Program Repair: Fixing 55 out of 105 Bugsfor $8 Each. In Proceedings of the International Conference on Software Engineering(ICSE). IEEE, 3–13.

[46] Claire Le Goues, Stephanie Forrest, andWestleyWeimer. 2013. Current challengesin automatic software repair. Software quality journal 21, 3 (2013), 421–443.

[47] Chao Liu, Xifeng Yan, Long Fei, Jiawei Han, and Samuel P Midkiff. 2005. SOBER:Statistical Model-Based Bug Localization. Proceedings of the International Sympo-sium Foundations of Software Engineering (FSE) 30, 5 (2005), 286–295.

[48] Fernanda Madeiral, Simon Urli, Marcelo Maia, and Martin Monperrus. 2019.Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies.In Proceedings of the 26th IEEE International Conference on Software Analysis,Evolution and Reengineering (SANER ’19). https://arxiv.org/abs/1901.06024

[49] FernandaMadeiral, SimonUrli, MarceloMaia, andMartinMonperrus. 2019. Bears:An Extensible Java Bug Benchmark for Automatic Program Repair Studies. 2019IEEE 26th International Conference on Software Analysis, Evolution and Reengineer-ing (SANER) (Feb. 2019), 468–478. https://doi.org/10.1109/SANER.2019.8667991arXiv:1901.06024

[50] L. Madeyski, W. Orzeszyna, R. Torkar, and M. Józala. 2014. Overcoming theEquivalent Mutant Problem: A Systematic Literature Review and a ComparativeExperiment of Second Order Mutation. IEEE Transactions on Software Engineering40, 1 (Jan. 2014), 23–42. https://doi.org/10.1109/TSE.2013.44

[51] Lech Madeyski and Norbert Radyk. 2010. Judy–a mutation testing tool for Java.IET software 4, 1 (2010), 32–42.

[52] Sonal Mahajan and William G.J. Halfond. 2014. Finding HTML Presentation Fail-ures Using Image Comparison Techniques. In Proceedings of the 29th ACM/IEEEInternational Conference on Automated Software Engineering (ASE ’14). ACM, NewYork, NY, USA, 91–96. https://doi.org/10.1145/2642937.2642966

[53] P. Reales Mateo, M. Polo Usaola, and J. L. Fernández Alemán. 2013. Validat-ing Second-Order Mutation at System Level. IEEE Transactions on SoftwareEngineering 39, 4 (April 2013), 570–587. https://doi.org/10.1109/TSE.2012.39

[54] Jens Meinicke, Chu-Pan Wong, Christian Kästner, Thomas Thüm, and GunterSaake. 2016. On Essential Configuration Complexity: Measuring Interactions inHighly-Configurable Systems. In Proceedings of the 31st IEEE/ACM InternationalConference on Automated Software Engineering (ASE 2016). ACM, New York, NY,USA, 483–494. https://doi.org/10.1145/2970276.2970322

[55] Hung Viet Nguyen, Christian Kästner, and Tien N. Nguyen. 2014. ExploringVariability-Aware Execution for Testing Plugin-Based Web Applications. In Pro-ceedings of the 36th International Conference on Software Engineering (ICSE 2014).

ACM, New York, NY, USA, 907–918. https://doi.org/10.1145/2568225.2568300[56] A. Jefferson Offutt. 1992. Investigations of the Software Testing Coupling Effect.

ACM Transactions on Software Engineering and Methodology 1, 1 (Jan. 1992), 5–20.https://doi.org/10.1145/125489.125473

[57] A. Jefferson Offutt, Ammei Lee, Gregg Rothermel, Roland H. Untch, and ChristianZapf. 1996. An Experimental Determination of Sufficient Mutant Operators. ACMTransactions on Software Engineering and Methodology 5, 2 (April 1996), 99–118.https://doi.org/10.1145/227607.227610

[58] A. J. Offutt, G. Rothermel, and C. Zapf. 1993. An Experimental Evaluationof Selective Mutation. In Proceedings of 1993 15th International Conference onSoftware Engineering. 100–107. https://doi.org/10.1109/ICSE.1993.346062

[59] E. Omar, S. Ghosh, and D. Whitley. 2014. HOMAJ: A Tool for Higher Or-der Mutation Testing in AspectJ and Java. In 2014 IEEE Seventh InternationalConference on Software Testing, Verification and Validation Workshops. 165–170.https://doi.org/10.1109/ICSTW.2014.19

[60] Elmahdi Omar, Sudipto Ghosh, and Darrell Whitley. 2017. Subtle Higher OrderMutants. Inf. Softw. Technol. 81, C (Jan. 2017), 3–18. https://doi.org/10.1016/j.infsof.2016.01.016

[61] Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and MarkHarman. 2019. Mutation Testing Advances: An Analysis and Survey. In Advancesin Computers. Vol. 112. Elsevier, 275–378. https://doi.org/10.1016/bs.adcom.2018.03.015

[62] M. Papadakis and N. Malevris. 2010. An Empirical Evaluation of the First andSecond Order Mutation Testing Strategies. In 2010 Third International Conferenceon Software Testing, Verification, and Validation Workshops. 90–99. https://doi.org/10.1109/ICSTW.2010.50

[63] Ali Parsai, Alessandro Murgia, and Serge Demeyer. 2017. LittleDarwin: A Feature-Rich and Extensible Mutation Testing Framework for Large and Complex JavaSystems. In Fundamentals of Software Engineering (Lecture Notes in Computer Sci-ence), Mehdi Dastani and Marjan Sirjani (Eds.). Springer International Publishing,148–163.

[64] Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui Abreu, Michael DErnst, Deric Pang, and Benjamin Keller. 2017. Evaluating and Improving FaultLocalization. In Proceedings of the International Conference on Software Engineering(ICSE). IEEE, 609–620.

[65] Goran Petrović andMarko Ivanković. 2018. State ofMutation Testing at Google. InProceedings of the 40th International Conference on Software Engineering: SoftwareEngineering in Practice (ICSE-SEIP ’18). ACM, Gothenburg, Sweden, 163–171.https://doi.org/10.1145/3183519.3183521

[66] Goran Petrovic and Marko Ivankovic. 2018. State of Mutation Testing at Google.In Proceedings of the International Conference on Software EngineeringâĂŤSoftwareEngineering in Practice (ICSE SEIP).

[67] G. Petrovic, M. Ivankovic, B. Kurtz, P. Ammann, and R. Just. 2018. An IndustrialApplication of Mutation Testing: Lessons, Challenges, and Research Directions. In2018 IEEE International Conference on Software Testing, Verification and ValidationWorkshops (ICSTW). 47–53. https://doi.org/10.1109/ICSTW.2018.00027

[68] Macario Polo, Mario Piattini, and Ignacio García-Rodríguez. 2009. Decreasing theCost of Mutation Testing with Second-Order Mutants. Softw. Test. Verif. Reliab.19, 2 (June 2009), 111–131. https://doi.org/10.1002/stvr.v19:2

[69] Elnatan Reisner, Charles Song, Kin-Keung Ma, Jeffrey S. Foster, and Adam Porter.2010. Using Symbolic Evaluation to Understand Behavior in Configurable Soft-ware Systems. In Proceedings of the 32Nd ACM/IEEE International Conference onSoftware Engineering - Volume 1 (ICSE ’10). ACM, New York, NY, USA, 445–454.https://doi.org/10.1145/1806799.1806864

[70] Manos Renieres and Steven P Reiss. 2003. Fault Localization with Nearest Neigh-bor Queries. In Proceedings of the International Conference on Automated SoftwareEngineering (ASE). IEEE, 30–39.

[71] Thomas Schmitz, Dustin Rhodes, Thomas H. Austin, Kenneth Knowles, andCormac Flanagan. 2016. Faceted Dynamic Information Flow via Control andData Monads. In Proceedings of the 5th International Conference on Principles ofSecurity and Trust - Volume 9635. Springer-Verlag New York, Inc., New York, NY,USA, 3–23. https://doi.org/10.1007/978-3-662-49635-0_1

[72] Koushik Sen, George Necula, Liang Gong, and Wontae Choi. 2015. MultiSE:Multi-Path Symbolic Execution Using Value Summaries. In Proceedings of the2015 10th Joint Meeting on Foundations of Software Engineering - ESEC/FSE 2015.ACM Press, Bergamo, Italy, 842–853. https://doi.org/10.1145/2786805.2786830

[73] S. Tokumoto, H. Yoshida, K. Sakamoto, and S. Honiden. 2016. MuVM: HigherOrder Mutation Analysis Virtual Machine for C. In 2016 IEEE InternationalConference on Software Testing, Verification and Validation (ICST). 320–329.https://doi.org/10.1109/ICST.2016.18

[74] David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-ChuanLiu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio-Gonzalez. 2019.BugSwarm:Mining and Continuously Growing a Dataset of Reproducible Failuresand Fixes. In 2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE). IEEE, Montreal, QC, Canada, 339–349. https://doi.org/10.1109/ICSE.2019.00048

https://doi.org/10.1002/stvr.1529

https://doi.org/10.1002/stvr.1529




https://doi.org/10.1145/2950290.2950322


https://doi.org/10.1109/ASE.2013.6693142



https://doi.org/10.1016/j.jss.2010.07.027

https://doi.org/10.1016/j.jss.2010.07.027

https://doi.org/10.1145/2610384.2628052

https://doi.org/10.1145/2610384.2628052

https://arxiv.org/abs/1901.06024

https://doi.org/10.1109/SANER.2019.8667991

https://arxiv.org/abs/1901.06024


https://doi.org/10.1145/2642937.2642966


https://doi.org/10.1145/2970276.2970322

https://doi.org/10.1145/2568225.2568300

https://doi.org/10.1145/125489.125473

https://doi.org/10.1145/227607.227610

https://doi.org/10.1109/ICSE.1993.346062




https://doi.org/10.1016/bs.adcom.2018.03.015

https://doi.org/10.1016/bs.adcom.2018.03.015



https://doi.org/10.1145/3183519.3183521


https://doi.org/10.1002/stvr.v19:2

https://doi.org/10.1145/1806799.1806864

https://doi.org/10.1007/978-3-662-49635-0_1

https://doi.org/10.1145/2786805.2786830

https://doi.org/10.1109/ICST.2016.18



[75] Roland H Untch, A Jefferson Offutt, and Mary Jean Harrold. 1993. Mutationanalysis using mutant schemata. In ACM SIGSOFT Software Engineering Notes,Vol. 18. ACM, 139–148.

[76] Alexander von Rhein, Sven Apel, and Franco Raimondi. 2011. Introducing BinaryDecision Diagrams in the Explicit-State Verification of Java Code.

[77] Bo Wang, Yingfei Xiong, Yangqingwei Shi, Lu Zhang, and Dan Hao. 2017. FasterMutation Analysis via Equivalence Modulo States. In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2017).ACM, New York, NY, USA, 295–306. https://doi.org/10.1145/3092703.3092714

[78] Chu-Pan Wong, Jens Meinicke, and Christian Kästner. 2018. Beyond TestingConfigurable Systems: Applying Variational Execution to Automatic ProgramRepair and Higher Order Mutation Testing. In Proceedings of the 2018 26th ACMJoint Meeting on European Software Engineering Conference and Symposium onthe Foundations of Software Engineering (ESEC/FSE 2018). ACM, New York, NY,USA, 749–753. https://doi.org/10.1145/3236024.3264837

[79] Chu-PanWong, Jens Meinicke, Lukas Lazarek, and Christian Kästner. 2018. FasterVariational Execution with Transparent Bytecode Transformation. Proc. ACMProgram. Lang. 2, OOPSLA (Oct. 2018), 117:1–117:30. https://doi.org/10.1145/3276487

[80] Hao Zhong and Zhendong Su. 2015. An Empirical Study on Real Bug Fixes. InProceedings of the 37th International Conference on Software Engineering - Volume1 (ICSE ’15). IEEE Press, Piscataway, NJ, USA, 913–923.

A DISCUSSION ON PRECISIONIn this section we discuss the precision of our new prioritization-based approach (searchpri) compared to the genetic algorithm(searchgen) and brute-force (searchbf). In general, it is difficultto fairly compare the approaches as they excel in different stagesof the search, and all search-based approaches converge to a lowprecision after running the algorithm for a long time.

Different Stages of the Search. While searchpri is designed tofind SSHOMs early, searchgen initially starts with a random set ofcandidates, which means it is essentially a random search in theearly stage. The genetic search could get better over time whenthe fitness function becomes useful in guiding the search, but it isdifficult to predict due to the stochastic nature of genetic algorithm.Also, the initial precision of searchgen and searchbf can be influ-enced by the order of first-order mutants to combine. For example,ifmut_47 is contained in many SSHOMs then the approaches mightbe more precise if mut_47 is selected early instead of later.

Long Running Search. All three search-based approaches even-tually tend to find significantly fewer solutions (e.g., get stuck ina local optimum). Thus, the precision gets lower, even close to 0%if the number of SSHOMs is small comparing to the search space.Thus, measuring the overall precision of the approaches after, forexample, 12 hours reveals limited insights.

To give an intuition of precision, we compare the studied approacheswith an ideal approach (searchideal) that generates SSHOMs withperfect precision. The execution time of searchideal is based onthe average execution time of the test suite. That is, if the test suitetakes one second to execute then the approach would take 100seconds to find and evaluate 100 SSHOMs.

We show the results in Figure 6. The left-hand side shows theprogress of finding SSHOMs within the 12 hour budget, plottedin linear scale to give an intuitive overview of progress. To illus-trate and compare precision, we show a focused view that focuseson the beginning of the search. In the plots, the steepness of thecurves is the precision at any given point in time. The steepness ofsearchideal illustrates the maximum possible precision that search-based approaches can achieve. In general, we can see that searchprihas a high precision, especially in the very beginning of the search.

Linear view Focused view

Triang

le

0h 3h 6h 9h 12h

096

5

0m 5m

096

5

Cli

0h 3h 6h 9h 12h

037

6

0m 10m 20m 30m

037

6

Mon

opoly

0h 3h 6h 9h 12h

081

8

0m 1h 2h

081

8

Chess

0h 3h 6h 9h 12h

016

403

0m 1h 2h

030

00

Valid

ator

0h 3h 6h 9h 12h

050

000

0m 1h 2h

050

00

searchvar searchgensearchbf searchpri searchideal

Figure 6: Performance of the approaches for findingSSHOMs, plotted on a linear scale. The searchideal line de-picts the performance of an ideal approach, as a referenceto gauge the precision of other approaches over time.

When searching for more difficult SSHOMs, the precision getslower. Especially, in Chess and Validator, we can see that the linesfor searchideal and searchpri are almost parallel, showing thatthe precision is close to ideal. Note that the shift of the lines comesfrom the initial effort of searchpri for evaluating all first-ordermutants and for generating the initial set of candidate SSHOMs.

Regarding searchbf, it appears relatively efficient when search-ing for second-order SSHOMs in Triangle, Cli, andMonopoly.Whensearching for higher orders than two, the precision drops drasticallyfor searchbf. As discussed, the initial precision of searchgen isclose to a random approach, but we can see that the precision mightbecome better over time (see, for example, Triangle and Cli).

https://doi.org/10.1145/3092703.3092714

https://doi.org/10.1145/3236024.3264837

https://doi.org/10.1145/3276487

https://doi.org/10.1145/3276487

Documents

Efficiently Finding Higher-Order Mutantsckaestne/pdf/arxiv20hom.pdfrealistic faults or improving test optimization techniques. Despite interest in studying promising higher-order mutants,