JOURNAL OF LA STAR: Stack Trace based Automatic Crash ...hunkim/papers/chen-tse2015.pdf · can effectively outperform existing crash reproduction approaches. Index Terms—Crash reproduction,

0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TSE.2014.2363469, IEEE Transactions on Software Engineering

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1

STAR: Stack Trace based Automatic CrashReproduction via Symbolic Execution

Ning Chen, Sunghun Kim, Member, IEEE,

Abstract—Software crash reproduction is the necessary first step for debugging. Unfortunately, crash reproduction is often laborintensive. To automate crash reproduction, many techniques have been proposed including record-replay and post-failure-processapproaches. Record-replay approaches can reliably replay recorded crashes, but they incur substantial performance overhead toprogram executions. Alternatively, post-failure-process approaches analyse crashes only after they have occurred. Therefore they donot incur performance overhead. However, existing post-failure-process approaches still cannot reproduce many crashes in practicebecause of scalability issues and the object creation challenge.This paper proposes an automatic crash reproduction framework using collected crash stack traces. The proposed approach combinesan efficient backward symbolic execution and a novel method sequence composition approach to generate unit test cases thatcan reproduce the original crashes without incurring additional runtime overhead. Our evaluation study shows that our approachsuccessfully exploited 31 (59.6%) of 52 crashes in three open source projects. Among these exploitable crashes, 22 (42.3%) are usefulreproductions of the original crashes that reveal the crash triggering bugs. A comparison study also demonstrates that our approachcan effectively outperform existing crash reproduction approaches.

Index Terms—Crash reproduction, static analysis, symbolic execution, test case generation, optimization.

F

1 INTRODUCTION

Software crash reproduction is the necessary first step forcrash debugging. Unfortunately, manual crash reproduction istedious and time consuming [15]. To assist crash reproduction,many techniques have been proposed [12], [20], [33], [38],[43], [52], [63] including record-replay approaches and post-failure-process approaches. Record-replay approaches [12],[43], [52] monitor and reproduce software executions by usingsoftware instrumentation techniques or special hardware thatstores runtime information. Most record-replay approachescan reliably reproduce software crashes, but incur non-trivialperformance overhead [12].

Different from record-replay approaches, post-failure-process approaches [20], [33], [38], [49], [63] analyse crashesonly after they have occurred. Since post-failure-process ap-proaches only use the crash data collected by the bug or crashreporting systems [4]–[7], [14], [30], [51], they do not incurperformance overhead to program executions. The purpose ofpost-failure-process approaches varies from explaining pro-gram failures [20], [38] to reproducing program failures [33],[63]. For failure explanation approaches, they try to explainthe cause of a crash by analyzing the data-flow or crashcondition information. For failure reproduction approaches,they try to reproduce a crash by synthesizing the originalcrash execution. However, the efficiency of existing failurereproduction approaches is not satisfactory due to scalabilityissues such as the path explosion problem [16], [31] wherethe number of potential paths to analyze grows exponentially

• N. Chen and S. Kim are with the Department of Computer Science andEngineering, The Hong Kong University of Science and Technology, ClearWater Bay, Kowloon, Hong Kong.E-mail: {ning, hunkim}@cse.ust.hk.

to the number of conditional blocks involved. In addition,crashes for object-oriented programs cannot be effectivelyreproduced by these approaches because of the object creationchallenge [60] where the desired object states cannot beachieved because non-public fields are not directly modifiable.

This paper proposes a Stack-Trace-based Automatic crashReproduction framework (STAR), which automatically repro-duces crashes using crash stack traces. STAR is a post-failure-process approach as it tries to reproduce a crash only afterit has occurred. Compared to existing post-failure-processapproaches, STAR has two major advantages: 1) STAR intro-duces several effective optimizations which can greatly boostthe efficiency of the crash precondition computation process;and 2) Unlike existing failure reproduction approaches, STARsupports the reproduction of crashes for object-oriented pro-grams using a novel method sequence composition approach.Therefore, the applicability of STAR is greatly broadened.

To reproduce a reported crash, STAR first performs abackward symbolic execution to identify the preconditions fortriggering the target crash. Using a novel method sequencecomposition approach, STAR creates a test case that cangenerate test inputs satisfying the computed crash triggeringpreconditions. Since STAR uses the crash stack trace to repro-duce a crash, its goal is to create unit test cases that can crashat the same location as the original crash and produce crashstack traces that are as close to the original crash stack traceas possible. The generated unit test cases can then be executedto reproduce the original crash and help reveal the underlyingbug that caused this crash.

We have implemented STAR for the Java language andevaluated it with 52 real world crashes collected from threeopen source projects. STAR successfully exploited 31 (59.6%)crashes within a maximum time of 100 seconds each on a com-




modity platform. Since a successful crash exploitation maynot always help reveal the crash triggering bug (Section 5),a detailed investigation was carried out on the usefulness ofeach crash exploitation. Among all crash exploitations, 22(42.3%) were identified as useful reproductions of the originalcrashes, meaning that they could help reveal the actual crashtriggering bugs. In comparison, existing approaches such asRandoop [45] and BugRedux [33] could generate useful re-productions of only 8 and 7 crashes, respectively (Section 5.4).We also conducted a developer survey to evaluate how usefulare the generated test cases by STAR.

This paper makes the following contributions:• A novel framework that automatically reproduces crashes

based on crash stack traces and implementation, STAR.• An empirical evaluation of STAR to investigate its use-

fulness.The rest of the paper is organized as follows. Section 2 gives

an overview of STAR. Sections 3 and 4 present STAR’s ap-proaches in detail. Section 5 presents an empirical evaluation.Section 6 discusses the major challenges and potential futurework for STAR and identifies threats to validity. Section 7surveys related work, and Section 8 concludes.

2 OVERVIEWIn this section, we present an overview of STAR. The detailsof STAR will be presented in Section 3 and Section 4.

Figure 1 is the architecture of STAR. It contains four phases:1) stack trace processing, 2) initial crash condition inferring,3) crash precondition computation, and 4) test case generation.

Figure 2 is an example to illustrate STAR’s four phases.The ComponentImpl class contains three buggy methods. Byexecuting these methods, suppose three different crashes arereported as shown in Figure 3. For crash (a), when thecheckValidity method is called while m_length , m_width, aRuntimeException is explicitly thrown at line 07 of the method,causing a program crash. For crash (b), a NullPointerExceptionis thrown at line 20 when the createColor method is calledwhile colorID ≤ 0 and m_defaultColor == null. For crash (c), aNullPointerException is thrown at line 32 when the add methodis called while comp1 , null and comp2 == null.

The following sub-sections overview how STAR generatesa crash reproducible test case in four phases.

Stack Trace Processing In the first phase, STAR processesthe bug report to extract the essential crash information suchas: 1) the bug id, 2) the version number, and 3) the crash stacktrace. A crash stack trace contains the exception type, namesand line numbers of the methods being called at the time ofthe crash.

For bug reports submitted to bug reporting systems such asBugzilla [51], their bug id and version number information canbe extracted automatically from the “id” and “Version” fieldsof the bug reports. To extract the crash stack trace information,STAR performs a regular expression matching to the bugreports’ “Description” and “Comments” fields. Crash stacktraces generally have very similar text patterns. Therefore, theycan be extracted from the bug reports using regular expressionmatching.

class ComponentImpl implements IComponent {01 public void checkValidity ( ) {02 for (IComponent comp : m_subComponents ) {03 if ( !comp .isValid ( ) )04 System .out .println ( "Warning : . . . " ) ;05 }06 if (m_length != m_width )07 throw new RuntimeException ( ) ; / / crash (a )08 }0910 public Color createColor (int colorID ) {11 Color color = new Color (colorID ) ;12 if (m_doLogging )13 m_logger .log (color ) ;1415 Color createdColor = null ;16 if (colorID > 0)17 createdColor = color ;18 else19 createdColor = m_defaultColor ;20 createdColor .initialize ( ) ; / / crash (b )21 return createdColor ;22 }2324 public void add (IComponent comp1 , IComponent comp2 ) {25 if (m_doLogging )26 m_logger .log (comp1 ) ;27 if (comp1 != comp2 ) {28 if (m_doLogging )29 m_logger .log (comp2 ) ;30 }31 int id1 = comp1 .getID ( ) ;32 int id2 = comp2 .getID ( ) ; / / crash (c )33 . . .34 }

. . .private int m_width , m_length ;private Logger m_logger ;private boolean m_doLogging ;private Color m_defaultColor ;private IComponent [ ] m_subComponents ;

}

Fig. 2. Motivating example

RuntimeException :at ComponentImpl .checkValidity (ComponentImpl .java : 7 )at . . .

(a) Crash stack trace of Figure 2 crash (a)

NullPointerException :at ComponentImpl .createColor (ComponentImpl .java : 2 0 )at . . .

(b) Crash stack trace of Figure 2 crash (b)

NullPointerException :at ComponentImpl .add (ComponentImpl .java : 3 2 )at . . .

(c) Crash stack trace of Figure 2 crash (c)

Fig. 3. Crash stack traces of Figure 2.

Initial Crash Condition Inferring STAR then infers theinitial crash condition that triggers the reported crash at thecrash location. For example, the initial crash condition forFigure 2 crash (b) at line 20 is {createdColor == null} as aNullPointerException is thrown at this location. Initial crashconditions can vary according to the content of the crash linesand the type of exceptions.

As a proof of concept, STAR currently supports crashescaused by three types of exceptions: 1) Explicitly thrown ex-ception: this type of exception has an initial crash conditionof {TRUE}, meaning that the exception can be triggered when




Program

1 2

Crash Report

Crash Reproduction

3

Stack Trace Processing

Crash Condition Inferring

Crash Precondition Computation

TestGen: A novel test generation technique Test Cases

Test Case Generation

4

Fig. 1. Architecture of STAR which consists of four main phases: stack trace processing, crash condition inferring,crash precondition computation, and test case generation.

the throw instruction is reached; 2) NullPointerException: thistype of exception has an initial crash condition in the formof {ref == null}, where ref is the variable being dereferencedat the crash location; 3) ArrayIndexOutOfBoundsException: thistype of exception has an initial crash condition in the formof {index < 0 or index >= array.length}, where array is the arrayvariable being accessed at the crash location and index is theaccessing index. Sometimes, multiple initial crash conditionsmay be inferred. For example, if a NullPointerException is raisedat a line where two different dereferences (e.g. x.foo + y.bar)exist, two initial crash conditions will be inferred (i.e. {x ==null} and {y == null}). Each initial crash condition will then beused to reproduce the NullPointerException individually.

With additional engineering effort, STAR can support moretypes of exceptions. However, there are a few types of ex-ceptions whose initial crash conditions are difficult to infer.Some typical examples include ClassNotFoundException, Inter-ruptedException and FileNotFoundException. In general, the initialcrash conditions are difficult to infer if the triggering of theseexceptions depends on the global program state or the externalenvironment.

Crash Precondition Computation Using the crash stacktrace and initial crash condition, STAR next tries to figureout how to trigger the target crash at method entries. Weadapt a symbolic execution-based approach to compute aset of weakest preconditions [27] for triggering the targetcrash. For example, for Figure 2 crash (a), a precondition totrigger the crash at the entry of method checkValidity would be:{m_subComponents , null && m_length , m_width}.

Test Case Generation Finally, STAR constructs a test caseto reproduce the target crash using a novel method sequencecomposition approach. This approach generates test inputs thatsatisfy the crash triggering precondition. Once these test inputsare generated, a unit test case is constructed by invoking thecorresponding crashing method in the crash stack trace withthese inputs.

3 CRASH PRECONDITION COMPUTATION

To compute the preconditions that trigger the original crash(in short, crash triggering preconditions), STAR staticallyanalyzes the crashed methods using a backward symbolic

execution, which is inter-procedural, and is both path andcontext sensitive.

STAR’s symbolic execution is implemented following atypical worklist approach [44], similar to the ones employed instatic analysis frameworks [20], [42]. However, these existingsymbolic execution algorithms are not efficient for crash trig-gering precondition computation because they do not leveragecrash stack traces and initial crash conditions to guide theirexecutions. Therefore, STAR adapts and extends an existingsymbolic execution algorithm [20] so that it: 1) leveragesthe input crash stack trace to guide the backward symbolicexecution and reduce the search space of the execution, and 2)uses the initial crash condition (e.g. obj == null) as the startingcondition for the execution since our goal is to compute thecrash triggering preconditions.

Furthermore, typical symbolic execution algorithms facechallenges such as path explosion [31] and inefficient back-tracking. To address these challenges, we introduce severaloptimization techniques that can improve the performance ofsymbolic execution when computing crash triggering precon-ditions (Section 3.2).

3.1 Symbolic Execution Overview

Algorithm 1 presents our backward symbolic execution algo-rithm. The algorithm takes two inputs: 1) the crash stack tracecrashStack, and 2) the initial crash condition φinit. Its output isa set of crash triggering preconditions computed for the targetcrash.

Algorithm 1 initiates the backward symbolic executionprocedure (i.e. sym_exec) with φinit as the initial postcondition(Lines 01 - 05). To find out the exact instruction where thesymbolic execution should start, a getInstrPrecedingCrash proce-dure is called. This procedure finds and returns the instructionimmediately preceding the instruction that causes the targetcrash (Line 04). Typically, the cause of an Explicitly thrownexception is the throw instruction, the cause of a NullPointerEx-ception is the variable dereferencing instruction, and the causeof an ArrayIndexOutOfBoundsException is the array accessinginstruction. For example, assume a NullPointerException hasbeen raised at line x.foo(y), and the input initial crash conditionis {x == null}. The getInstrPrecedingCrash procedure will returnthe instruction immediately preceding the dereference of x.




Algorithm 1 Backward Symbolic Execution

Input the crash stack trace crashStackInput the initial crash condition φinit

Begin01 Initialize satList to hold satisfiable preconditions.

02 method ← crashStack.frame(1).getMethod()

03 line ← crashStack.frame(1).getLineNum()

04 startInst ← getInstrPrecedingCrash(method, line, φinit)

05 sym_exec(method, startInst, crashStack, 1, φinit, satList)Output satList.

Procedure sym_exec(m, startInst, cs, frmIndex, φpost, satList)

06 worklist ← initialize stack and push < startInst, φpost >

07 while worklist , ∅: // depth-first propagation

08 < instruction, φpost > ← worklist.pop()

09 if instruction is an invocation instruction:

10 φpre ← call sym_exec for invocation target method

11 else12 φpre ← transform φpost according to instruction type

13

14 if instruction , m.ENTRY:

15 predecessors ← findPredecessors(instruction)

16 for each predecessor in predecessors:

17 worklist.push(< predecessor, φpre >)

18 else if m is a method in the stack trace:

19 if solver.check(φpre) == SATISFIABLE:

20 model ← solver.getModel(φpre)

21 satList.add(< frmIndex, φpre, model >)

22

23 if frmIndex < cs.totalFrames:

24 m’ ← cs.frame(frmIndex + 1).getMethod()

25 line’ ← cs.frame(frmIndex + 1).getLineNum()

26 startInst’ ← getInstrPrecedingCallSite(m’, line’)

27 sym_exec(m’, startInst’, cs, frmIndex+1, φpre, satList)

28 else // inside some invocation context

29 return φpre // continue propagation in caller

The instruction returned by getInstrPrecedingCrash is then usedas the starting instruction for the sym_exec procedure (Line 05).

During the backward symbolic execution proceduresym_exec, path conditions are collected along the executionpaths into symbolic formulae (denoted as φpre or φpost). WhenAlgorithm 1 reaches a non-invocation instruction, it evaluatesthe instruction according to its instruction type to updatethe current precondition [20] (Lines 11 - 12). For a methodinvocation instruction, such as invokevirtual or invokestatic,Algorithm 1 computes its precondition by recursively callingsym_exec to descend the current symbolic execution into the in-vocation target method (Lines 09 - 10). We use the Andersen-style pointer analysis [11] to determine the concrete targets ofinvokevirtual instructions. When multiple possible targets arefound at a call site, the algorithm will descend to each possibletarget method at this call site. Alternatively, an option is alsoavailable for specifying the maximum number of targets to

consider at each call site.When sym_exec reaches the entry of the method under

analysis, it evaluates or returns the computed precondition φpre

according to the type of the current method as follows. If thecurrent method is listed in the stack trace frames (i.e., it isa method in crashStack), the satisfiability of φpre is evaluatedusing an SMT solver, Yices [28]. Satisfiable preconditions aresaved as the crash triggering preconditions for the current stackframe level (Lines 18 - 21). After that, Algorithm 1 continuesthe backward execution from the caller (i.e., the previous stackframe in crashStack) of the current method to compute thecrash triggering preconditions for the caller method (Lines 23- 27). The symbolic execution in the caller method starts fromthe location where the caller calls the current method. Thesatisfiable preconditions obtained in the current stack framelevel are used as initial conditions for the caller method’ssymbolic execution.

If the current method is not listed in the stack trace frames(i.e., it is not a method in crashStack), it is a target methodof an invocation instruction, which STAR has descended intofor computing the precondition of the invocation instruction.In this case, the current precondition φpre is returned to thecaller’s sym_exec procedure (Lines 28 - 29) as the preconditioncomputed for the invocation instruction.

Algorithm 1 stops when it has finished computing crashtriggering preconditions for each method in crashStack.

Illustrative example To illustrate the symbolic executionsteps, consider Figure 2 crash (c). Since a NullPointerExceptionis raised at line 32, the initial crash condition input to Algo-rithm 1 is {comp2 == null}. The sym_exec procedure initiallystarts from line 31 which is the first statement preceding tothe dereference of comp2. Along the backward execution, pathconditions are added to the current precondition φpre. Forexample, at line 31, path condition {comp1 , null} is addedto φpre. The implicit path condition {callee , null} is alwaysadded automatically at a dereference statement. At line 27,either {comp1 , comp2} or {comp1 == comp2} is added to φpre

depending on the control flow of the current execution. Whenreaching an invocation instruction such as line 26, the sym_execprocedure is recursively called to compute the precondition formethod log. The original execution for method add is resumedwhen the sym_exec procedure on method log returns. Finally,when the execution reaches to the method entry at line 25,since this method (i.e. add) is listed in the crash stack trace,the SMT solver is invoked to evaluate the satisfiability ofthe current precondition φpre. If φpre is satisfiable, it is savedas one of the crash triggering preconditions for the currentstack frame. The execution continues until all paths have beentraversed or some termination criteria have been satisfied.

Termination handling Due to the undecidability prob-lem [13], sym_exec could run forever. This problem is mainlycaused by the existence of loops and recursive calls in thetarget code. To handle this problem, we use two global optionsto limit the maximum number of times loops can be unrolledand the maximum recursive invocation depth. With theserestrictions, sym_exec could traverse each program path in finitesteps. In addition, to ensure the termination of the algorithm




within a reasonable time, a maximum time can be assigned tothe crash precondition computation process.

Field and array modeling Since STAR’s symbolic executionruns backward, field owners and array indices can only bedetermined during executions after the use of the fields orarrays. Therefore, they cannot be modeled like ordinary localvariables. STAR models fields and arrays during the symbolicexecution as functions. Fields are modeled as a function f : X→ Y , where X = {x | x ∈ the set of object references} andY = {y | y ∈ the set of object references ∨ y ∈ the set ofprimitive values}. For example, a field foo.str is modeled as afunction str( f oo), where foo is the object reference and str isthe function for all fields under the name str. Arrays are alsomodeled as a function array : (X × I) → Y where (X × I) ={(x, i) | x ∈ the set of array references and i ∈ the set of naturalnumbers}, and Y = {y | y ∈ the set of object references ∨ y∈ the set of primitive values}. For example, an array elementfoo.str[0] is modeled as a function array(str( f oo), 0), wherestr(foo) is the array reference and 0 is the array index. All arrayelements are modeled under the function array. Both field andarray access operations (i.e., read and write) are modeled byfunctional read and update from the theory of arrays [39], [40].

Other language features Besides the basic features, STARalso supports or partially supports other language featuressuch as floating point arithmetic, bitwise operations and stringoperations. We discuss each of these features individually.

As floating point arithmetic is not natively supported bythe SMT solver, Yices, STAR converts floating point values torational numbers whose arithmetic is supported. This mightcause some potential imprecision to the arithmetic, but hasnot affected the correctness of the crash reproductions in ourevaluation study.

Similar to the floating point arithmetic, bitwise operationson integers are not directly supported by Yices. Fortunately,since bitwise operations for bit-vectors are natively supported,we are able to simulate bitwise operations on an integer bycreating a bit-vector variable whose value equals to the bitlevel representation of the integer. We then perform the desiredbitwise operations directly on the auxiliary bit-vector variable,and convert the result back to an integer. It is worth notingthat bitwise operations in Yices are expensive. Therefore, aformula involving many bitwise operations either takes a longtime to evaluate or the evaluation simply runs into a timeout.

Finally, string operations are only partially supported bySTAR. In general, off-the-shelf SMT solvers such as Yicesand Z3 do not have or have only very limited support forstring operations. Therefore, STAR represents string variablesduring symbolic execution as an array of characters (integers).This allows STAR to perform some basic string operationssuch as equals, length and even partial of some more complexoperations like indexOf and contains. However, as discussedin Section 6.1, the lack of (complete) support for complexstring operations and regular expressions is one of the majorchallenges for irreproducible crashes. Thus, having a morespecialized solver for string constraints may greatly improvethe capability of STAR to reproduce string related crashes.The integration of more specialized constraint solvers such as

HAMPI [34] or Z3-str [64] to STAR remains as future work.

3.2 OptimizationsTypical worklist style symbolic execution algorithms face chal-lenges that can sometimes cause executions to be inefficient.Therefore, in this section, we propose several optimizationapproaches to improve the efficiency of STAR’s symbolicexecution.

3.2.1 Static Path PruningA major challenge for typical worklist style symbolic exe-cution algorithms is the path explosion problem [16], [31].The number of potential execution paths grows exponentiallyto the number of conditional blocks. For a loop containing aconditional block, it has the same effect on the path explosionas a sequence of conditional blocks. Figure 2 crash (a)demonstrates one such example. In this Figure, the numberof potential execution paths for the checkValidity method growsexponentially because of the presence of the loop from Line02 to Line 05. The path explosion problem is challenging,but it could be alleviated in the case of finding a crashingpath since we are more interested in the conditions thatcould trigger a crash. Therefore, we propose the static pathpruning approach to heuristically reduce potential paths bypruning non-contributive conditional blocks during symbolicexecution.

The key idea of the static path pruning approach is thatconditional blocks may not always contribute to the triggeringof the target crash. Our static path pruning approach tries toslow down the growth of potential paths in the worklist bypruning non-contributive conditional blocks. Essentially, thisapproach can be viewed as an instance of backward staticprogram slicing [8], [59].

However, a typical backward static program slicing ap-proach is not directly applicable to STAR. The main reasonis scalability. During the backward symbolic execution ofSTAR, the set of variables of interest in the slicing criterionis constantly expanding. For example, for an Explicit thrownexception, the initial set of variables of interest is empty becausethe initial symbolic formula (i.e., φpre in Algorithm 1) is{TRUE} at the throw statement. As the symbolic execution goeson, more conditions are added to φpre, so more variables ofinterest are added to the slicing criterion. As a result, it isnecessary to re-compute a new program slice whenever theset of interesting variables expends, which is not scalable.Therefore, instead of using a more precise but expensivestatic program slicing approach, STAR introduces a static pathpruning approach, which is relatively conservative but morescalable.

In the static path pruning approach, a conditional block isconsidered contributive to the target crash if it satisfies anyof the following criteria. 1) A conditional block is consideredcontributive if it defines a local variable which is referencedin the current symbolic formula φpre. This criterion is obvioussince a definition of any local variable referenced in φpre by theblock would establish a direct data dependency between φpre

and the conditional block. 2) A conditional block is considered




contributive if it assigns a value to a field whose field nameis referenced in φpre. This criterion is similar to the first oneexcept that it is for object fields. Due to the nature of thebackward symbolic execution, the owner objects of fields maynot be decidable at the time of use. Therefore, instead ofcomparing two fields directly, we only compare if the namesof the fields assigned by the conditional block match with thenames of the fields referenced in φpre. This makes sure thatthe identification of a non-contributive block is conservativebut safe. Note that this criterion may not be safe when portingSTAR to languages where pointer arithmetics are allowed. 3)A conditional block is considered contributive if it assigns avalue to an array element when there is an array element withcompatible type referenced in φpre. This criterion is similar tothe first two except that it is for array elements. However,unlike criterion two, we do not compare the array namesbecause it is possible to have two arrays with different namessharing the same memory location. Therefore, our approachconservatively considers two array elements are the same ifthey have compatible types.

This static path pruning approach is more conservative thana typical static program slicing approach. However, it has somedesirable advantages. First, it takes only linear time to retrieveall assignable information (i.e. defined local variables, fieldnames, array types) for each conditional block in a method.Second, the assignable information only needs to be computedonce for each method and is reusable afterwards. Finally, it iseasy to identify if a conditional block is contributive or notusing the previous definition.

Retrieving the assignable information for each conditionalblock in a method is straightforward. The approach examineseach instruction in the method to see if the current instructiondefines a local variable, assigns a field or modifies an arraylocation. For such cases, the defined variable, assigned fieldname or array type is saved as the assignable informationof the conditional blocks where the current instruction is in.When an invocation instruction is reached and the assignableinformation for the invocation target method has not beencomputed, the approach performs inter-procedural analysis tocompute the assignable information for the invoked methodand then adds the result back to the current computation.Recursive calls do not need to be analyzed since their instruc-tions are already being examined. The assignable informationfor each method is computed only once. Afterwards, theinformation can be reused during the computation of theircaller methods.

Figure 4 presents the pseudo-code for the skip_conditionalprocedure, which uses the assignable information to deter-mine whether the current symbolic execution can skip aconditional block and where it should skip to. Given thecurrent executing instruction, procedure skip_conditional firstinvokes find_assignable to find the assignable information foreach conditional block in the current method as previouslystated. After retrieving the assignable information (i.e., definedlocal variables, field names and array types), the procedurethen finds all conditional blocks merging at instruction. Forexample, the conditional block from line 03 to 04 mergesat line 05 in Figures 2 crash (a). These conditional blocks

Input the current instruction instructionInput the current method methodInput the current symbolic formula φInput the current assignable_info

if assignable information for method has not been computed:

find_assignable(method, assignable_info)conditional_blocks ← get_blocks_merge_at(instruction)sort conditional_blocks by starting instruction descendingly

skip_to ← instruction // by default, no skipping at allfor each cond_block ∈ conditional_blocks:

if !is_contributive(assignable_info.get(cond_block), φ):skip_to ← cond_block.start - 1

else breakOutput skip_to

Fig. 4. Pseudo-code for skipping non-contributive condi-tional blocks

are then sorted by their starting line numbers in descendingorder. For the same merging location, a conditional block witha larger starting line number is covered by the ones withsmaller starting line numbers. By checking the conditionalblocks in descending order, we can quickly find out the firstcontributive conditional block which should not be skipped.For each sorted conditional block, the is_contributive procedureis called to check if this conditional block is contributiveto the current formula φ according to the three contributivecriteria. Whenever a contributive conditional block is found, orSTAR has checked all conditional blocks merging at instruction,skip_conditional returns the first preceding instruction of thelast non-contributive conditional block. The backward sym-bolic execution will thus skip all non-contributive conditionalblocks and continue the execution from the first contributiveconditional block found. In Figures 2 crash (a), the conditionalblock from line 03 to 04 is skipped as it is non-contributive.Moreover, the loop from line 02 to line 05 can be skipped aswell.

The static path pruning approach might be unsafe if there arethrow instructions in the conditional block but no catch clausesto catch them before the crash location. Therefore, STARdoes not skip conditional blocks containing throw instructions.Another place where STAR might be unsafe is implicit ex-ceptions which can be thrown without a throw instruction. Forexample, an implicit NullPointerException could be thrown bythe statement foo.str = “” without the guarding condition: foo, null. STAR currently does not avoid path pruning for implicitexceptions. However, in our experiments, implicit exceptionsdid not actually cause any incorrectness because guardingconditions were always added by other statements that hadnot been skipped.

3.2.2 Guided BacktrackingAnother major challenge for typical worklist style symbolicexecution algorithms is that their backtracking mechanismcould be inefficient. For example, when we symbolicallyexecute the code snippet in Figure 2 crash (b), after executinga path (line 20, line 17 - line 10) backward and findingit unsatisfiable, a typical symbolic execution algorithm will




backtrack to the most recent branching point (i.e., line 12)and execute a new path (line 14, line 12 - line 10) from there.However, this kind of non-guided backtracking is inefficientas the conditional branch (line 12), which the algorithmbacktracks to, is actually irrelevant to the target crash and thenew path is definitely unsatisfiable as well.

To address this challenge, we introduce a new backtrackingheuristic guided by the unsatisfiable core (unsat core) ofprevious unsatisfiable formula input to the SMT solver. Theunsat core of an unsatisfiable formula is a subset of the formulaconstraints, which is still unsatisfiable by itself. Most modernSMT solvers support the extraction of the unsat core froman unsatisfiable formula. For example, for Figure 2 crash (b),after traversing the symbolic execution path (line 20, line 17- line 10), constraint {new Color(colorID) == null} is the unsatcore. The guided backtracking mechanism leverages the unsatcore information to make backtracking more efficient.

The guided backtracking mechanism works as follows:when a formula for a path has been proven unsatisfiable bythe SMT solver, STAR immediately extracts the unsat coreof this formula. The SMT solver [28] used in STAR alwaysreturns a minimal unsat core for the unsatisfiable formula.STAR then backtracks to the closest point where symbols ofthe constraints in the unsat core could be assigned to differentvalues, or any of the unsat core constraints has not yet beenadded to the formula. For the case of Figure 2 crash (b), aftertraversing the symbolic execution path (line 20, line 17 - line10), STAR immediately backtracks to line 19 since createdColorcould be assigned a different value in the else branch.

3.2.3 Intermediate Contradiction RecognitionFinally, during backward symbolic execution, inner contradic-tions could be developed in the current precondition formulaφpre. For example, if a definition statement v1 = null is executedwhen a condition {v1 , null} has been previously added to φpre,a contradiction is developed in φpre as condition {v1 , null}cannot be satisfied anymore. Similarly, when a condition {v1 ,v2} is added to φpre, it may also cause an inner contradictionif a condition {v1 == v2} already exists in φpre. Therefore,recognizing inner contradictions and dropping contradictedpaths as early as possible is important for improving theefficiency of symbolic execution.

However, even though the performance of SMT solvershas been improved significantly in recent years, it is stilltoo expensive to perform satisfiable checks to the currentprecondition formula at every execution step. To stop con-tradicted executions early with a reasonable number of SMTchecks, STAR performs SMT checks only at the merging pointsof conditional blocks. During backward symbolic executions,multiple paths can be pushed into the worklist only at themerging points. Therefore, detecting contradiction at mergingpoints can prevent contradicting paths from being pushedinto the worklist. In this way, already contradicted executionswould not cause an exponential growth of potential paths inthe worklist.

For Figure 2 crash (c), after traversing the path (line 32- line 30, line 27), three conditions are added to the currentprecondition formula: 1) {comp2 == null}, which is the initial

crash condition inferred at line 32, 2) {comp1 , null}, whichis the condition added at line 31, and 3) {comp1 == comp2},which is the condition for taking the branch from line 30 toline 27. Since these three conditions can never hold at thesame time, an inner contradiction has been developed in thecurrent precondition formula. The intermediate contradictionrecognition process quickly recognizes this contradiction andimmediately drops the corresponding path.

To verify the effectiveness of the proposed optimizationapproaches in STAR, we present a detailed comparison studyof STAR’s performance with and without optimizations inSection 5.2.4.

4 TEST CASE GENERATION

After computing the crash triggering preconditions, we needto generate test inputs that satisfy these preconditions. If thenecessary test inputs are primitive types, it is trivial to gen-erate them. However, if the necessary test inputs are objects,generating objects satisfying the crash triggering preconditionscould be a non-trivial task [21], [54], [60]. The major challengeis that non-public fields in objects (classes) are not directlymodifiable. By using Reflection, it is possible to modify non-public fields, but this easily breaks class invariants [17] andmay produce invalid objects. Such invalid objects may be ableto trigger the desired crashes, but they are not useful to revealthe actual bugs. Even if objects are successfully created byReflection using the class invariant information obtained fromhuman or through sophisticated analysis, they may be lesshelpful for fixing the crashes, since developers still do notknow how these crash triggering objects are created in non-Reflection manners. Therefore, test inputs should be createdand mutated through a legitimate sequence of method calls.

In this section, we present a novel demand-driven test casegeneration approach, TESTGEN. TESTGEN can compose thenecessary method sequences to create and mutate test inputssatisfying the target crash triggering preconditions.

4.1 Architectural DesignFigure 5 presents the architectural design of TESTGEN.For a given subject program, TESTGEN first computes theintra-procedural method summary information for the subjectprogram (Section 4.3). Then, this intra-procedural summaryinformation is used to compute the precise summary of eachprogram method (Section 4.4). Finally, using the computedmethod semantic information, TESTGEN composes legitimatemethod sequences for generating test inputs that satisfy thedesired crash triggering preconditions (Section 4.5.1).

4.2 Summary OverviewThe summary of a method is the semantics of the methodrepresented in the propositional logic. Since a method can havemultiple paths, its summary can be defined as the disjunctionof the summaries of its individual paths. The summary ofeach individual path can be further defined as φpath = φpre

∧ φpost, where φpre is the path conditions represented as aconjunction of constraints over the method inputs (i.e., any




Method Sequence Composition

Subject Program

Intra-Summary Database

Inter-Summary Database

Desired Object State

Intra-Summary Computation

Inter-Summary Computation

Method Method Methods

SMT Solver

Method Sequences

1 2 3

Fig. 5. The architecture of TestGen. TestGen consists of three main phases: intra-summary computation, inter-summary computation, and method sequence composition using the computed method semantic information. TheSMT solver module is used in all three phases.

memory address read by the method). φpost is the postconditionof the path represented as a conjunction of constraints overthe method outputs (i.e., any memory address written by themethod). We refer φpre and φpost of a method path as the pathcondition and effect of this path from now on.

The method summary provides us the precise semanticmeaning of this method. This semantic information is neces-sary for composing the method sequences for generating targetobjects. Therefore, the method summary needs to be computedfirst.

Computing the summary of a method can be expensivewhen this method involves numerous calls to other methodswhich in turn can make further calls. This is because the num-ber of potential paths needed to be explored grows exponen-tially to the number of method calls. TESTGEN adapts the ideaof compositional computation introduced in dynamic test gen-eration [10], [31] to help reduce the number of paths neededto be explored. However, instead of performing a top-downstyle computation like the original compositional approaches,TESTGEN performs a bottom-up style computation. TESTGENcomputes method summaries compositionally in two steps.First, it computes the intra-procedural method summaries (inshort, intra-summaries) through a forward symbolic execution.Then, it computes the inter-procedural method summaries (inshort, inter-summaries) in a bottom-up manner by merging theintra-summaries together.

4.3 Intra-procedural Summaries4.3.1 OverviewThe first phase of TESTGEN is intra-summary computation.In this phase, TESTGEN applies an intra-procedural pathsensitive forward symbolic execution algorithm to all themethods whose summaries we want to compute. During thesymbolic execution of a method path, TESTGEN collects andpropagates the symbolic value of each variable along the pathinto a symbolic memory. To compute the summary of anexecuting path, φpath, TESTGEN adds a constraint (expressedover the symbolic values of method inputs) to the path con-dition φpre whenever it meets a conditional branch during the

execution. At the exit block of the path, TESTGEN constructsthe effect φpost of this path by collecting the symbolic values ofeach method outputs from the current symbolic memory. Thesummary of a method is computed by taking the disjunctionof the summaries of its individual paths.

Since the symbolic execution algorithm applied in thisphase is intra-procedural, effects introduced by callee methodsalong a path are not considered. TESTGEN uses skolemconstants [20] to represent the possible effects induced bythese skipped method calls. Section 4.3.4 discusses this issuein detail.

4.3.2 Forward Symbolic ExecutionFigure 6 presents the simplified pseudo-code for the intra-procedural summary computation. In general, it is a worklist-based algorithm [44] which traverses every method path inpreorder. For a target method, method, this algorithm computesthe summaries of its execution paths. The initial element inthe worklist stack is the method entry instruction along witha TRUE value for φpre, and an initial symbolic memory Swhere each method input is assigned with an initial symbolicvalue. Method instructions are traversed in preorder untilevery method path has been executed or some predefinedthresholds have been satisfied (e.g., maximum time limit). Atevery method instruction, the symbolic values in the currentsymbolic memory S are propagated according to the typeof instruction and the propagation rules listed in Table 1.Constraints may be added to the current path condition φpre

according to the instruction type, as specified in Table 1.For conditional instructions, the symbolic expressions of theirbranch conditions are also added to φpre according to theexecution control flow.

When the symbolic execution reaches to an exit node ofthe current method, the execution for this path finishes. Thealgorithm then leverages an SMT solver [28] to evaluate thesatisfiability of φpre. If φpre is satisfiable, the effect of thispath (i.e. φpost) is further constructed by collecting the sym-bolic values of all method outputs from the current symbolicmemory S. Note that the outputs of a method are not just




TABLE 1Forward propagation rules for some of the instruction types

INSTRUCTION TYPE PROPAGATION RULE CONSTRAINTS ADDING TO φpre

v1 = v2 put_val(v1, get_val(v2)) ∅

v1 = v2 op v3 put_val(v1, get_val(v2) op get_val(v3)) ∅

v1 = v2.f put_val(v1, read(f, v2)) v2 , nullv1 = v2[i] put_val(v1, read(array, (v2, i)) v2 , null, i ≥ 0, i < v2.lengthv1.f = v2 update(f, v1, get_val(v2)) v1 , nullv1[i] = v2 update(array, (v1, i), get_val(v2)) v1 , null, i ≥ 0, i < v2.length

v1 = new T put_val(v1, instanceof(T)) ∅

v1 = v2.func(v3, . . . )put_val(v1, σret (func))

v2 , nullfor any member or static field reachable by func in the current symbolic memory: σupdate (field, func)for any member or static array reachable by func in the current symbolic memory: σupdate (array, func)

Input the method to compute.

Global a worklist to hold the next instructions

Initialize a summaries set.

Initialize a Symbolic Memory S.worklist.push(< entry_inst, TRUE, S >)

while worklist , ∅: // preorder traversal< instruction, φpre, S > ← worklist.pop()S ← propagate S according to instruction.φpre ← add constraints to φpre according to instruction.successors ← findSuccessors(instruction)if successors , ∅:

for each successor in successors:if instruction is a conditional branch instruction:

φpre’ ← add branch condition to φpre

worklist.push(< successor, φpre’, S >)

else:worklist.push(< successor, φpre, S >)

else if solver.check(φpre) == SATISFIABLE:φpost ←

∧v∈outputs S(v)

summaries.add(φpre ∧ φpost)

Output summaries.

Fig. 6. Pseudo-code for intra-procedural summary com-putation

its return value, but also other memory addresses that can bewritten by this method such as parameter object fields, staticfields, and heap locations. Finally, the summary for this path(i.e., φpre ∧φpost) is added to summaries.

If a branch condition contains expressions that do not belongto the set of theories supported by the SMT solver (e.g., non-linear arithmetics), the execution will skip paths involving thisbranch.

In the presence of loops, the number of possible executionpaths is infinite, causing the algorithm to run forever. Toaddress this issue, the algorithm automatically unrolls loopsfor a certain number of iterations specified by users.

4.3.3 Propagation RulesTable 1 lists some propagation rules for the forward symbolicexecution. In this table, function get_val reads the currentsymbolic value of a variable from the current symbolic mem-ory. Function put_val writes a symbolic value to a variable inthe current symbolic memory. Functions read and update, as

discussed in Section 3.1, correspond to the read and updatefunctions in the theory of arrays.

4.3.4 Representing Uninterpreted InvocationsSince TESTGEN does not interpret method invocations duringthe intra-procedural summary computation, it uses skolemconstants [20] to represent the possible effects induced by theuninterpreted method invocations as shown in Table 1. Whenthe forward symbolic execution algorithm meets a methodinvocation, two types of skolem constants may be generated,namely σret and σupdate. A σret constant represents the possiblesymbolic value returned by this invocation. σupdate representsthe possible uninterpreted functional updates induced by thisinvocation to a function representing a field or an array.However, the set of fields or arrays updated by this invoca-tion is unknown during the current intra-procedural analysis.Therefore, all member and static fields or arrays, which arepotentially reachable by the invocation and referenced in thecurrent symbolic memory, will be updated.

With the help of the skolem constants, the forward sym-bolic execution algorithm can propagate the current symbolicmemory without actually interpreting any method invocations.All skolem constants will be interpreted to compute the inter-procedural summaries in the next phase.

4.4 Inter-procedural SummariesIntra-summaries do not contain the effects of method invo-cations. So they are not precise enough to be directly usedto compose method sequences for generating objects. In thisphase, TESTGEN computes the inter-procedural summaries(in short, summaries) for all program methods from theircorresponding intra-summaries previously computed.

4.4.1 Applying Callee SummariesTESTGEN is different from compositional test genera-tions [10], [31] in that it computes method summaries ina bottom-up manner leveraging a method dependency graphcomputed for the subject program. This allows TESTGEN tocompute the summary of a method before any of its dependent(i.e., caller) methods. The reason for using a bottom-up mannercomputation is that we want to retrieve the path summariesof a method as completely as possible with few redundantcomputations.

In general, the path summary of a (caller) method can becomputed by applying the summaries of its callee methods




one by one for all call sites along this path. Since one calleemethod can have multiple path summaries, each of thesesummaries will be applied to the current caller path summaryseparately. Therefore, the application of a callee method cangenerate a set of summaries corresponding to taking differentpaths within the callee method.

Applying a callee path (φcallee_path = φcallee_pre ∧ φcallee_post)to a caller path (φpath = φpre ∧ φpost) works as follows.First, the path condition of the applied summary (denotedas φapplied_pre) is constructed by taking the conjunction ofthe path conditions of the two (caller and callee) paths. So,φapplied_pre = φpre ∧ φcallee_pre. Formal variable names in thecallee path are converted to actual names in the caller pathbefore the application. Second, all skolem constants in φpath,which were introduced at the current call site during intra-procedural computation, are substituted with actual values oreffects from the callee. More specifically, all instances of σret

in φpath are substituted with the actual symbolic return value ofφcallee_path. All instances of σupdate for a field f are substitutedwith the actual functional updates to field f found in φcallee_path.This means that we are applying the effect of the callee pathon field f to the caller path. Finally, the SMT solver checks forthe satisfiability of the path condition of the applied summary(i.e. φapplied_pre). Summaries with unsatisfiable path conditionsare discarded immediately before applying the effect of othercallee methods from the rest of the call sites.

To avoid infinite computation in presence of recursive orindirect recursive calls, TESTGEN sets a maximum depth forapplying recursive method summaries. Thus, recursive methodsummaries are applied to the current φpath for limited depth.

4.4.2 Other DetailsDue to the presence of virtual calls, one call site may havemultiple possible targets. TESTGEN determines the concretetargets for all virtual calls in the subject program with a pre-computed call graph using an Andersen-style analysis [11]. Ifmultiple targets are possible at a call site during the processof applying callee summaries, TESTGEN applies each possibletarget separately to the current caller path.

For methods with infinite or large number of paths, TEST-GEN may not retrieve the complete set of summaries corre-sponding to every possible method path. Therefore, STAR maynot reproduce some crashes because of missing the path sum-mary information. However, the potential incompleteness ofthe path summary information does not affect the correctnessof any generated crash reproducible method sequences.

Figure 7 presents an example of a path summary forthe method ComponentImpl.createColor(int) from Figure 2. Thissummary corresponds to the method path: (Lines 11 - 12, 14 -17, 20 - 22). Variables v1, v2 and RET in the example representthe method callee, the parameter colorID and the return objectrespectively. The effect of this method path is returning a Colorobject with its m_colorID field and m_initialized field set to v2and true respectively.

4.5 Method Sequence CompositionAfter computing the summary for each program method,TESTGEN can now use these method semantic information

Path Conditions :v1 != nullv1 .m_doLogging == falsev2 > 0

Effect :RET instanceof ColorRET .m_colorID = v2RET .m_initialized = true

Fig. 7. An example of a path summary

to compose method sequences to satisfy the crash triggeringpreconditions. A crash triggering precondition can consist ofconditions involving several test inputs. All test inputs shouldsatisfy their corresponding desired object states for the wholecrash triggering precondition to be satisfied. Therefore, tocreate a crash reproducible test case, TESTGEN needs tocompose method sequences that can generate all test inputssatisfying their corresponding object states.

4.5.1 ExampleWe first use a simple example in Figure 8 adapted from theApache Commons Collections library to illustrate the generalmethod sequence composition process1. For the purpose ofillustration only, we omit some in-depth details such as methodselection, path selection, and validating formula constructionduring the illustration. These details will be presented in Sec-tions 4.5.3 to 4.5.5. In Figure 8, the solid rectangles representsome desired object states which we want to satisfy. Theyare numbered at its upper right corner. The ovals with dashedborders are candidate methods which have been selected byTESTGEN to compose the method sequence.

Initially, a desired object state S1-1 is input. This isthe final object state we want to satisfy by TESTGEN.Through a method selection process, TESTGEN finds a methodMKMap.MKMap(LMap) which can generate objects satisfyingthis object state. However, to achieve this object state bymethod MKMap.MKMap(LMap), its input parameter (i.e., v2in the figure) should satisfy another object state S2-1. Moreprecisely, object state S2-1 is the precondition for methodMKMap.MKMap(LMap) to generate objects satisfying objectstate S1-1. Thus, the satisfaction problem for object state S1-1has been deduced into the satisfaction problem for object stateS2-1.

Following a similar method selection process, TESTGENfinds another method LMap.addMapping(Object) which canmodify a LMap object to object state S2-1 if preconditionobject states S2-2 and S3-1 have been both satisfied. Thenagain, the satisfaction problem for object state S2-1 has beendeduced into the satisfaction problems for object state S2-2and object state S3-1 for parameters v2 and v3 respectively.

For object state S2-2, TESTGEN finds that by calling methodLMap.LMap(int, int) with integer parameters 1 and 2, a LMapobject satisfying this object state can be generated. Similarly,by calling method Object.Object(), parameter object v3 whichsatisfies object state S3-1 can be generated. After that, bothobject states S2-2 and S3-1 are deduced to TRUE, meaning

1. Class names and method names have been modified for better readability.MKMap is short for MultiKeyMap and LMap is short for LRUMap.




int v5 = 2;

int v4 = 1;

Object v3 = new Object();

LMap v2 = new LMap(v4, v5);

v2.addMapping(v3);

MKMap v1 = new MKMap(v2);

Composed Sequence:

v1 = instanceof MKMapv1.map = instanceof LMapv1.map.size = 1v1.map.data.length = 1

MKMap(v2)v2 = instanceof LMapv2.size = 1v2.data.length = 1

v2.addMapping(v3)

LMap(v4, v5)

v3 = instanceof Object

v4 = 1v5 = 2

v2 = instanceof LMapv2.size = 0v2.threshold = 2v2.data.length = 1

Object()

S1-1

S2-1

S3-1

S2-2

S3-2

Fig. 8. A method sequence composition exampleadapted from Apache Commons Collections

that we do not need to satisfy additional precondition objectstates anymore.

Finally, by concatenating all selected candidate methodsbackward, TESTGEN composes a method sequence as shownin the lower left corner of the figure. By executing this methodsequence, an object satisfying the input object state S1-1 canbe generated.

4.5.2 Method Sequence Deduction

Figure 9 presents the pseudo-code for the sequence composi-tion approach. This approach composes a method sequence tosatisfy a desired object state in a deductive process. Given adesired object state, φstate, the approach first selects a set ofcandidate methods which may produce an object satisfying thisstate. Details for the method selection process are presentedin Section 4.5.3.

For each candidate method m, its path summaries are loaded.Then, TESTGEN performs a path summary selection as dis-cussed in Section 4.5.4 to reduce the number of paths neededto be examined. For each selected path, TESTGEN checks ifthis path can produce the desired object state, by leveraging anSMT solver [28]. More precisely, TESTGEN examines whetherfor a path with path summary φpath, φpath ∧ φstate is satisfiable.Details about the construction of the validating formula fromφpath and φstate are presented in Section 4.5.5.

If the SMT solver returns satisfiable for a method path, theapproach further queries an input assignment (i.e. a model)which satisfies φpath ∧ φstate, from the solver. The currentcandidate method m can produce the desired object state whenits inputs satisfy this assignment. Then, this assignment isseparated into several states (deduced_states) corresponding todifferent inputs (e.g. callee, parameters, static fields, etc.). Inthis way, the satisfaction problem for φstate is deduced into thesatisfaction problems for the deduced_states.

If a deduced_state corresponds to a primitive type, it canbe easily satisfied by directly assigning the desired value. Ifa deduced_state corresponds to an object type, the sequencecomposition approach is recursively invoked to compose themethod sequence for generating this deduced object state.

Input desired object state φstate to satisfy

Input the target class clazz

Find candidate_methods from clazz.for each method m in candidate_methods:

full_sequence ← EMPTYsummaries ← load m’s path summaries

summaries ← select candidate path summaries

for each summary in summaries:formula ← construct validate formula for summaryif solver.check(formula) == SATISFIABLE:

assignment ← solver.getModel(formula)deduced_states ← separate assignment by inputs.

deduced_sequences ← generate for deduced_statesif deduced_sequences have all been generated:

full_sequence.add(deduced_sequences)full_sequence.add(new InvokeStatement(m))

breakif full_sequence != EMPTY:

breakOutput full_sequence

Fig. 9. Pseudo-code for method sequence composition

If all deduced_states have been satisfied, a full_sequence iscomposed by adding the deduced_states’ generation sequences(deduced_sequences) and an invocation statement to the currentmethod m with the generated inputs. If the generation sequencefor any of the deduced_states cannot be composed, the sequencecomposition approach goes on to examine the next pathsummary of m. During the composition process, TESTGENcaches all composed sequences and their corresponding de-duced_states. For a new deduced_state that is subsumed by apreviously satisfied deduced_state, TESTGEN will reuse thepreviously generated input or the cached sequence.

4.5.3 Candidate Method SelectionIn the target class clazz, there are many methods with variouseffects. To reduce the number of methods needed to beexamined during the sequence composition process, TESTGENonly selects a subset of these methods as the candidate set tobe examined.

The criterion for selecting a candidate method is that theselected method should be a non-private method or constructorthat can modify any of the fields referenced in φstate. Forexample, for a desired object state: v1.size == 1, only methodsor constructors which can modify the size field of the classof v1 are selected as the candidate methods. For a desiredobject state which does not reference any fields, such as: v1 ==instanceof Foo, the non-private constructors or factory methodsof class Foo are selected as the candidate methods. In the futureversions of TESTGEN, we will also try to select methods fromother classes to generate the desired object states.

The selected candidate methods are also prioritized based ontheir number of path summaries, the number of non-primitivetype parameters and how directly do they modify the targetfields (i.e. how many subsequent calls they need to go through




to modify the target fields). Methods with less path summaries,less non-primitive type parameters and which can modify thetarget fields more directly are assigned higher priorities to beexamined.

4.5.4 Path Summary SelectionNot every loaded path summary is useful for the sequencecomposition. For a target object state, we want to examine asless path summaries as possible before finding the satisfiableone. Therefore, similar to the candidate method selection,TESTGEN also performs a path summary selection process toreduce the number of path summaries needed to be examined.

For the path summary selection, TESTGEN introduces alattice-based summary selection approach. To construct thelattice, we define a partial order ≤ over the set of pathsummaries as follows. For two path summaries, S1 and S2,we have S1 ≤ S2 if the satisfaction of the deduced_state ofS2 subsumes the satisfaction of the deduced_state of S1. Inother words, if the deduced_state of S2 can be satisfied, thededuced_state of S1 can always be satisfied as well.

Using this partial order definition, TESTGEN constructs alattice L = (S, ≤) where S is the set of path summaries, and≤ is the summary partial order previously defined. Duringthe method sequence composition process, TESTGEN onlyselects path summaries that are not greater than any elementwhose deduced_state cannot be satisfied. This is because, ifthe deduced_state of an element S1 is not satisfiable, thededuced_states of all elements greater than S1 in L are notsatisfiable as well by definition.

For example, consider two path summaries S1 and S2 withtheir deduced_states: S1: {foo.bar , null} and S2: {foo.bar , null&& foo.size == 1}, S1 ≤ S2 holds in L according to the partialorder definition. If the deduced_state of S1 cannot be satisfied,the deduced_state of S2 cannot be satisfied as well. Thus,TESTGEN discards S2 and any other path summaries greaterthan S1 during the method sequence composition process.

4.5.5 Validating Formula ConstructionTo examine whether a path summary φpath can produce φstate,TESTGEN translates this problem into the satisfaction problemof a validating formula F which is represented as a set of SMTstatements acceptable by Yices. To construct F, all methodinputs which have been referenced in either φpath or φstate,are defined as variables of F. In particular, for method inputswhich are not related to field or array accesses, they are mod-eled as basic variables of F. For method inputs which are fieldsor arrays, TESTGEN models them as uninterpreted functions (aspecial kind of variable) of F. Any value assignments in thesummary effect are modeled as functional update statementsin F. Similarly, any value dereferences in φpath or φstate aremodeled as functional read statements in F.

TESTGEN constructs F following a specific order. First,the summary path condition is modeled as a set of assertionstatements over the method inputs. Then, the summary effectis introduced as a set of functional update statements to themethod outputs represented as uninterpreted functions in F.Finally, the desired object state, φstate, is encoded as assertionstatements over the method outputs as well.

After the translation, the satisfaction problem of φpath inproducing φstate is equivalent to the satisfaction problem ofthe constructed validating formula F. Moreover, any satisfiableassignment to F is also a satisfiable assignment to the inputs ofthe current method for producing φstate. TESTGEN then usesthe SMT solver to check the satisfiability of the constructedformula F.

Figure 10 presents an example of a validating for-mula. This formula validates the feasibility of methodMKMap.MKMap(LMap) in Figure 8 to produce the desiredobject state S1-1. Lines 01 - 06 define the method inputsin the form of basic variables and uninterpreted functionsof the formula. Line 07 models the method path conditionas an assertion statement over the method input v1. Line 08models the summary effect with a functional update statement.It means that the effect of this method path is to assign v2to v1.map. Finally, Lines 09 - 11 model the desired objectstate S1-1 using a set of assertion statements over the methodoutputs.

01 (define v1 : : MKMap )02 (define v2 : : LMap )03 (define map : : ( − > MKMap LMap ) )04 (define size : : ( − > LMap int ) )05 (define data : : ( − > LMap Object [ ] ) )06 (define length : : ( − > Object [ ] int ) )07 (assert ( / = v1 null ) )08 (define map@2 : : ( − > MKMap LMap ) /∗ effect ∗ /

(update map v1 v2 ) )09 (assert (= typeOf (map@2 v1 ) LMap ) )10 (assert (= size (map@2 v1 ) 1 ) )11 (assert (= length (data (map@2 v1 ) ) 1 ) )

Fig. 10. A validating formula example

5 EXPERIMENTAL EVALUATION5.1 Experimental SetupThe objective of the evaluation is to investigate the followingresearch questions:

RQ1 Exploitability: How many crashes can STAR exploitbased on crash stack traces?

RQ2 Usefulness: How many of the exploitation from RQ1are useful crash reproductions for debugging?

5.1.1 SubjectsWe used three open source projects as the subjects in thisevaluation: apache-commons-collections [2] (ACC), apache-ant [1] (ANT) and apache-log4j [3] (LOG). ACC is a datacontainer library that implements additional data structuresover JDK. ANT is a Java build tool that supports a number ofbuilt-in and extension tasks such as compile, test and run Javaapplications. LOG is a library that improves runtime loggingfor Java programs. All three subjects are real world mediumsize projects with 25,000, 100,000 and 20,000 lines of coderespectively.

ACC, ANT and LOG were chosen because their bug track-ing systems (i.e., Bugzilla and JIRA) were well maintainedand many bug reports contained crash stack traces along withdetailed bug descriptions. In addition, these subjects have alsobeen commonly used in the literature [22], [25], [32], [46].




All experiments were run on an Intel Corei7 machinerunning Windows 7 and Java 6 with 1.5GB of maximum heapspace.

5.1.2 Bug Report Collection and Processing

We extracted bug report1 information, such as the ids of thebugs, the affected program versions, and their correspondingcrash stack traces.

In our experiments, we used all bug reports except thefollowing: (1) not fixed bug reports because we could notdetermine whether they were real bugs; (2) bug reports whichdid not contain any crash stack traces or provide sufficientinformation (e.g. crash triggering inputs) for manual reproduc-tion because STAR requires stack traces to reproduce crashes;(3) bug reports with invalid crash stack traces. To examinethe validity, we extracted and checked if the methods in thecrash stack traces actually exist in the specified lines in theprogram. The main cause of the occurrences of invalid crashstack trace is the incorrect version numbers reported by theusers; (4) finally, STAR currently accepts crashes caused bythree types of exceptions: a) Explicitly thrown exception, b)NullPointerException and c) ArrayIndexOutOfBoundsException. Wechose these three types of exceptions because they are the mostcommon types of exceptions [41] in Java.

We have collected 12 bug reports from ACC, 21 fromANT and 19 from LOG for the experiments. Table 2 presentsstatistics of the collected bug reports. In the table, columns‘Versions’, ‘Avg. Fix Time’ and ‘Bug Report Period’ list theversions, the average bug fixing time and the bug creationdates for the collected bug reports. Columns ‘Method Size’and ‘Method Size (All)’ present the average size of the crashedmethods and the average size of all methods in the crashstack traces for the bug reports. The full list of bug reportsused in this paper is available at https://sites.google.com/site/starcrashstack/.

TABLE 2Bug reports collected from ACC, ANT and LOG

Name Verions Method Method Avg. Fix Bug ReportSize Size (All) Time Period

ACC 2.0 - 4.0 14.9 13.6 42 days Oct 03 - Jun 12ANT 1.6.1 - 1.8.3 30.4 36.9 25 days Apr 04 - Aug 12LOG 1.0.0 - 1.2.16 16.7 14.0 77 days Jan 01 - Oct 09

We applied STAR to these collected crashes and tried togenerate test cases to reproduce them. During our experiments,we set the maximum loop unrollment to 1, and the maximumdepth of invocation to 10 for the crash precondition compu-tation process. We also set the maximum computation timefor this process to 100 seconds. In our experiments, furtherincreasing these maximum values did not bring noticeableimprovement to the current experiment result. As we willfurther discuss in Section 6.1, the experiment setting is notthe major cause of irreproducible crashes.

1. bug reports for ACC: https://issues.apache.org/jira/browse/collections/,and bug reports for ANT and LOG: https://issues.apache.org/bugzilla/

5.2 Crash Exploitability5.2.1 Criterion of Exploitability and ResultsThe first research question we investigated is crash exploitabil-ity. We used the following criterion [12] to determine whethera crash can be exploited by STAR:

Criterion 1. A crash is considered exploitable if the generatedtest case by STAR can trigger the same type of exception atthe same crash line.

TABLE 3Overall Exploitation Result

Subject # of Crashes # Exploitable Crashes Exploitation Rate

ACC 12 8 66.7%ANT 21 12 57.1%LOG 19 11 57.9%

Total 52 31 59.6%

Table 3 presents the overall crash exploitation result ofSTAR. For the three subjects, STAR successfully exploited 8crashes out of 12 for ACC, 12 out of 21 for ANT, and 11 outof 19 for LOG. Overall, STAR exploited 31 (59.6%) out ofthe 52 crashes.

5.2.2 Exploitation LevelOriginal crash stack traces collected from bug reports mayhave multiple stack frames as shown in Figure 11. Reproduc-ing (exploiting) a crash from higher frame levels may increasethe chance of revealing the bug. This is because if a test casecan exploit the crash from frame levels higher than the framelevel which the bug lies in, developers can follow its executionin a debugger and reveal how buggy behaviors are introduced.However, a crash may not always be exploited up to the higheststack frame level because of the object creation challenge [60].In this section, we investigate the exploitation levels of crashes.

java.lang.NullPointerException

at org.apache.log4j.helpers.Transform.appendEscapingCDATA(Transform.java:74)

at org.apache.log4j.xml.XMLLayout.format(XMLLayout.java:115)

at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:292)

at org.apache.log4j.RollingFileAppender.subAppend(RollingFileAppender.java:225)

at org.apache.log4j.WriterAppender.append(WriterAppender.java:150) …

Exception

Frame 5

Frame 1

Frame n

Crash

Stack Fram

es

Fig. 11. An example of crash stack frames

Table 4 lists all exploitable crashes with their detailedexploitation results. Column ‘Prec / Exploit / Total’ in the tableshows 1) the highest stack frame level for which STAR suc-cessfully computed crash triggering preconditions (i.e., Prec),2) the highest stack frame level for which STAR successfullygenerated a crash exploiting test case (i.e., Exploit), and 3)the total number of stack frames in the original crash stacktrace (i.e., Total). The ‘Time’ column shows the total timefor computing crash triggering preconditions and generatingcrash exploiting test cases. For example, the total number of




TABLE 4Detailed Crash Exploitation Results

Bug id Prec / Exploit / Total Buggy Usefulness Time (s)

ACC-4 Frame 1 / 1 / 1 Frame 1 Useful 3.4ACC-28 Frame 1 / 1 / 1 Frame 1 Useful 2.0ACC-35 Frame 3 / 3 / 3 Frame 2 Useful 1.6ACC-48 Frame 6 / 6 / 6 Frame 5 Useful 2.3ACC-53 Frame 1 / 1 / 1 Frame 1 Useful 4.2ACC-77 Frame 2 / 2 / 2 Frame 2 Not Useful 17.2ACC-104 Frame 1 / 1 / 1 Frame 1 Useful 4.6ACC-411 Frame 3 / 3 / 3 Frame 3 Useful 13.8

ANT-33446 Frame 3 / 1 / 9 Frame 4 Not Useful 2.6ANT-36733 Frame 2 / 2 / 6 Frame 4 Not Useful 1.4ANT-38458 Frame 2 / 1 / 2 Frame 1 Useful 10.9ANT-38622 Frame 4 / 1 / 5 Frame 1 Useful 17.3ANT-41422 Frame 2 / 2 / 2 Frame 2 Useful 20.2ANT-43292 Frame 2 / 2 / 2 Frame 1 Useful 48.5ANT-44689 Frame 7 / 7 / 14 N/A Not Useful 34.0ANT-44790 Frame 3 / 3 / 4 Frame 1 Useful 11.6ANT-49137 Frame 2 / 1 / 4 Frame 4 Not Useful 14.4ANT-49755 Frame 3 / 3 / 4 Frame 2 Useful 14.9ANT-49803 Frame 4 / 4 / 4 Frame 4 Useful 13.3ANT-50894 Frame 6 / 4 / 14 N/A Not Useful 21.9

LOG-29 Frame 2 / 2 / 2 Frame 1 Useful 5.1LOG-11570 Frame 10 / 10 / 10 Frame 1 Useful 9.4LOG-31003 Frame 1 / 1 / 1 Frame 1 Useful 3.6LOG-40159 Frame 2 / 2 / 2 Frame 1 Useful 2.3LOG-40212 Frame 2 / 2 / 2 Frame 1 Not Useful 1.9LOG-41186 Frame 12 / 12 / 13 Frame 3 Useful 22.3LOG-45335 Frame 1 / 1 / 1 Frame 1 Not Useful 1.4LOG-46271 Frame 6 / 6 / 6 Frame 1 Useful 20.5LOG-47547 Frame 1 / 1 / 1 Frame 1 Useful 9.3LOG-47912 Frame 3 / 3 / 5 Frame 4 Not Useful 0.7LOG-47957 Frame 6 / 6 / 6 Frame 1 Useful 41.5

crash stack frames extracted from bug report ANT-50894 is 14.STAR successfully computed crash triggering preconditionsfor this crash for up to stack frame level 6. STAR exploitedthis crash from stack frame level 4. The total time spent oncomputing the crash triggering precondition and generating thecrash exploiting test case was 21.9 seconds.

Table 4 shows that 20 (64.5%) out of the 31 crashes werefully exploited (i.e. Exploit = Total). For the 11 crashes thatwere not fully exploited, STAR was able to exploit 7 crashes tomultiple frame levels. Overall, 27 (87.1%) of the 31 exploitedcrashes were fully or multiple frame-level exploitation.

In terms of time spent, STAR could exploit crashes within areasonable time. For 16 out of the 31 exploited crashes, STARwas able to exploit them in less than 10 seconds. On average,each crash was exploited in 12.2 seconds.

5.2.3 Test Case GenerationAs shown in Table 4, STAR’s test case generation approach isvery effective for generating crash exploiting test cases fromcrash triggering preconditions. For 26 (83.9%) out of the 31exploited cases, STAR was able to generate crash exploitingtest cases for up to the same levels as the crash triggeringpreconditions (i.e., Prec = Exploit). This means that once acrash triggering precondition is computed, STAR’s test casegeneration approach can effectively compose the necessarymethod sequences for generating test inputs that exploit thecrash.

Table 5 lists further detailed statistics for the test casegeneration process. Column ‘Avg. Objects’ shows the av-erage number of input objects required to exploit crashesfor the three subjects. Column ‘Avg. Candidates’ shows the

TABLE 5Statistics of Test Case Generation

Subject Ave. Objects Avg. Candidates Ave. (Min - Max) Sequence

ACC 1.5 35.5 9.4 (2 - 19)ANT 1.4 11.7 6.2 (2 - 14)LOG 1.5 21.8 8.1 (2 - 17)

Total 1.5 21.4 7.7 (2 - 19)

average number of method candidates considered during thetest case generation. Column ‘Avg. (Min - Max) Sequence’shows the average, minimum and maximum lengths of thegenerated crash exploiting sequences. Overall, for each crash,it required an average of 1.5 input objects to be created.During the method sequence generation process, an averageof 21.4 method candidates had been considered. Finally, theaverage length for the generated method sequences was 7.7lines (irrelevant lines such as method headers and importdeclarations, etc. were all excluded).

5.2.4 OptimizationsTo investigate the effectiveness of STAR’s optimization ap-proaches (Section 3.2), we compared the crash exploitabilityof STAR with and without the optimizations.

Table 6 presents the detailed results. Columns ‘Lv’ in thetable show the highest stack frame levels for which STARsuccessfully generated crash exploiting test cases. In total,STAR’s optimizations improved the highest crash exploitingstack frame levels for 16 cases compared to STAR withoutoptimizations. For cases such as LOG-41186 and LOG-46271,the optimizations significantly improved the highest crashexploiting stack frame levels by 5 and 6 levels respectively.

Columns ‘Time’ present the crash precondition computationtime (in second) for the highest stack frame levels. Forexample, for ANT-44689, STAR with optimizations took 4.6seconds to compute the crash triggering preconditions forstack frame level 7. However, without optimizations, STARwas not able to compute any crash triggering preconditionfor level 7 within the maximum time limit (denoted as 100in Table 6). For each crash, we compared the computationtime for the same (highest) stack frame level only becausecomparing the computation time for different stack framelevels is meaningless. In total, STAR’s optimizations reducedthe crash precondition computation time for 27 cases. Amongthem, 22 cases that are highlighted in bold have achievedsignificant improvements.

After comparing the time spent, we further studied threeimportant numbers during the computation: the number ofpaths considered, the number of SMT solver checks per-formed, and the number of conditional blocks encountered.Since these three numbers are closely related to the scalabilityof the symbolic execution, they are useful indicators of theeffectiveness of STAR’s optimization approaches. Columns‘Path’, ‘Check’ and ‘Block’ list the three numbers with andwithout optimizations. In total, for 25 out of the 31 cases, atleast one of the three numbers (for 19 cases, all three numbers)have been significantly reduced (highlighted in bold in thetable) after applying STAR’s optimizations. On average, thenumber of paths considered, the number of SMT solver checks




TABLE 6STAR with / without Optimizations

Without Optimizations With Optimizations

Bug id Lv Path Check Block Field Array Call Time Lv Path Check Block Field Array Call Prune Back Recog Time

ACC-4 1 32 32 43 356 8 222 3.6 1 10 16 6 63 3 39 3.3 3.0 2.3 2.0ACC-28 1 12 12 8 55 3 12 0.7 1 10 13 6 45 3 10 0.6 0.5 0.6 0.6ACC-35 3 2 2 0 6 0 6 0.4 3 2 2 0 6 0 6 0.4 0.4 0.4 0.3ACC-48 6 2 2 0 4 0 10 0.7 6 2 2 0 4 0 10 0.7 0.7 0.7 0.7ACC-53 1 18 18 18 230 52 0 1.2 1 11 18 7 119 31 0 1.3 0.8 1.3 1.0ACC-77 2 2654 2654 2171 9448 216 5762 39.7 2 10 20 10 54 0 36 2.2 20.6 3.4 2.2ACC-104 1 18 18 16 230 34 88 1.8 1 11 18 6 119 20 49 1.8 1.2 1.6 1.3ACC-411 2 533 533 1007 11609 508 7241 100 3 45 104 93 550 63 347 84.4 100 100 8.4

ANT-33446 1 544 544 1067 4500 2383 3789 100 3 20 37 28 44 2 97 100 100 100 2.4ANT-36733 2 2 2 0 1 0 4 0.3 2 2 2 0 1 0 4 0.3 0.3 0.3 0.3ANT-38458 1 848 848 586 11659 1316 11511 100 2 11 40 21 46 0 61 5.9 100 100 5.9ANT-38622 2 1793 1793 1472 9399 2891 4358 100 4 3 11 18 77 2 61 100 5.0 100 6.2ANT-41422 2 648 648 654 5576 24 5712 84.2 2 2 7 5 16 0 23 2.8 73.8 41.0 2.8ANT-43292 0 689 689 525 7943 322 7381 100 2 11 33 18 22 0 25 100 100 100 4.3ANT-44689 3 468 468 524 4894 22 6765 100 7 9 26 21 149 5 115 100 100 100 4.6ANT-44790 2 1306 1306 1046 13185 196 4906 100 3 5 18 12 40 0 20 100 100 100 1.7ANT-49137 1 1292 1292 207 15078 11661 4786 100 2 211 226 278 2843 1667 1217 100 18.9 100 13.3ANT-49755 1 318 318 639 4217 1027 3125 100 3 14 39 41 115 4 64 100 100 100 3.0ANT-49803 3 4242 4242 2394 20707 2177 19461 100 4 15 23 13 69 3 71 1.3 100 100 1.3ANT-50894 4 2 2 705 9337 851 6424 100 6 28 52 40 191 73 121 100 100 100 13.2

LOG-29 1 27 27 4275 19487 957 21857 100 2 11 18 13 58 0 25 1.3 100 100 1.0LOG-11570 5 2032 2032 2697 10382 68 5065 100 10 5 18 14 63 1 58 100 100 100 2.2LOG-31003 1 13 13 1 20 0 27 2.4 1 5 5 0 3 0 7 0.6 2.5 2.5 0.6LOG-40159 2 1 1 2 21 1 17 0.4 2 1 2 2 21 1 17 0.4 0.4 0.5 0.4LOG-40212 2 2 2 778 21227 686 14313 100 2 2 2 0 2 0 4 0.6 100 100 0.6LOG-41186 7 693 693 1618 9577 724 5497 100 12 28 61 37 179 20 120 100 100 25.8 4.5LOG-45335 1 1 1 6 30 2 27 0.4 1 1 2 0 1 0 3 0.3 0.4 0.5 0.3LOG-46271 0 1329 1329 1221 5155 48 2450 100 6 37 86 56 202 3 110 6.0 20.6 100 5.7LOG-47547 1 36 36 15 204 0 252 2.2 1 14 18 6 74 0 86 2.1 2.1 0.9 1.0LOG-47912 3 2 2 0 6 0 7 0.3 3 2 2 0 5 0 7 0.3 0.3 0.3 0.3LOG-47957 0 1939 1939 2566 11309 73 5478 100 6 39 94 63 236 7 136 100 100 100 9.5

Average 2.0 693.5 693.5 847.1 6640.4 846.8 4727.5 59.3 3.4 18.6 32.7 26.3 174.7 61.5 95.1 39.2 50.0 54.3 3.3

performed and the number of conditional blocks encounteredhave been significantly reduced by 674.9 (97.3%), 660.8(95.3%) and 820.8 (96.9%), respectively.

We also present in columns ‘Field’, ‘Array’ and ‘Call’: thenumber of field accesses, array accesses and method invo-cations during symbolic execution with and without STAR’soptimizations. The results in these columns also show similarsignificant improvements after applying the optimizations.

To investigate the improvement coming from each individ-ual optimization, we further studied the performance of crashprecondition computation by using only one optimization at atime. Columns ‘Prune’, ‘Back’ and ‘Recog’ list the detailedresults. The numbers in these three columns correspond tothe time spent (in second) by allowing only static pathpruning, guided backtracking or intermediate contradictionrecognition respectively. In general, static path pruning wasthe most effective optimization, followed by guided back-tracking and intermediate contradiction recognition. The threeoptimizations achieved significant performance improvements(highlighted in bold in the table) for 9, 4 and 4 cases (13distinct cases) respectively. However, the combination of thethree optimizations (i.e. the ‘Time’ column) achieved muchhigher improvements (22 distinct cases) than any of themindividually.

Overall, Table 6 demonstrates the effectiveness of ouroptimizations to compute crash preconditions and create crashexploiting test cases in higher-level stack frames.

5.3 Usefulness5.3.1 Criterion of Useful ReproductionNot every crash exploitation by STAR might be considered auseful reproduction of the original crash for debugging. For

example, suppose a crashed method was purposely designednot to take any null value as parameters. In such a case,exploiting the crash by passing a null value to this method isnot useful to reveal the underlying crash triggering bug. Themore important question is, where did the crash triggering nullvalue come from? Therefore, a crash exploitation would onlybe considered a useful reproduction if it can reveal the originof the null value passed to this method.

Therefore, we study how many crash exploitations by STARare useful reproductions of the original crashes for debuggingusing the following criterion:

Criterion 2. A crash exploitation is considered to be a usefulreproduction if it can reveal the actual bug that causes theoriginal crash.

We manually examined each crash exploitation from Sec-tion 5.2 to investigate if it can reveal the crash triggeringbug. To reveal a bug, the exploited crash stack trace shouldinclude the buggy frame (i.e., the stack trace frame which thebuggy method lies in). To determine the buggy frame levelfor each crash, we carefully inspected the actual patch linkedto the corresponding bug report and identified the bug fixinglocations. For example, the buggy frame level for ACC-35 is2, and STAR exploited this crash from stack frame level 3.

In addition, to be a useful reproduction of the original crash,the crash exploiting test case by STAR should construct testinputs that trigger buggy behaviors (e.g., crashes immediatelyor generates corrupt values that eventually cause a crash)in the buggy location. Otherwise, the exploitation by thistest case would not be considered a useful reproduction. Forexample, for LOG-40212, although STAR could exploit thecrash from stack frame level 2, which includes the buggy




frame, it was not considered a useful reproduction becauseit did not demonstrate how the incorrect null value could beconstructed for the buggy method.

5.3.2 ResultTable 4 columns ‘Buggy’ and ‘Usefulness’ present the inves-tigation results. Column ‘Buggy’ in the table shows the stackframe levels which the corresponding buggy methods lie in.Column ‘Usefulness’ presents our manual investigation results.We manually inspected and marked each crash exploitation asUseful or Not Useful based on Criterion 2.

In total, 22 (42.3%) out of the 31 crash exploitations wereevaluated as Useful reproductions, since they could reveal thebugs that caused the original crashes. For ANT-44689 and ANT-50894, their bug locations were not in any of the methods intheir crash stacks, so they were marked as N/A in the ‘Buggy’column and their exploitations were identified as Not Useful.

Different crashes might be caused by the same bug. There-fore, we also investigated how the useful crash reproductionsby STAR were distributed across distinct bugs. We manuallyexamined all useful crash reproductions and their correspond-ing bug fixes. Overall, 21 out of the 22 useful crash repro-ductions were caused by distinct bugs. LOG-47957 was theonly exception as it shared the same bug fix with LOG-46271.However, since LOG-47957 and LOG-46271 were reportedfrom different program versions, their crash reproducing testcases were still different from each other.

5.3.3 Reproduction ExamplesTo demonstrate the capability of STAR in generating usefulcrash reproductions, we present several examples:

Example 1 (ACC-411) Figure 12 (a) shows the buggy methodListOrderedMap.putAll in ACC-4112. The source code has beenslightly modified for better readability3. This method is buggybecause the index variable at line 246 is always increasedafter the invocation to the put method. However, by design,index should not be increased if the put method had onlyreplaced an existing entry in the map instead of adding a newentry. The incorrect increment to index can cause an undesiredIndexOutOfBoundsException in certain situations as reported inACC-411.

To fix this bug, the project developers applied a fix asshown in Figure 12 (b). Essentially, this fix tries to find out ifmethod put has replaced an existing entry by checking whetherits return value is null. Unfortunately, this fix is actually notcorrect as it did not consider the case where the value ofthe replaced entry could also be null. According to the Javaspecification, the put method of any Map class should returnthe value of the replaced entry or null if no entry had beenreplaced. Therefore, in the case where an entry with a nullvalue is replaced, the put method will still return null, causingindex to be incorrectly increased in the fixed version.

By applying STAR, a test case was automatically generatedas shown in Figure 12 (c). The generated test case can trigger

2. https://issues.apache.org/jira/browse/collections-4113. The complete source code for ListOrderedMap is available at http://www.

cse.ust.hk/~ning/star/ACC411-ListOrderedMap.java.

243 public void putAll (int index , Map map ) {244 for (Map .Entry entry : map .entrySet ( ) ) {245 put (index , entry .getKey ( ) , entry .getValue ( ) ) ;246 ++index; / / buggy increment247 }248 }

(a) a buggy method in ACC 4.0-r1351903: the index variable at line 246 might beincorrectly increased which could lead to an IndexOutOfBoundsException.

243 public void putAll (int index , Map map ) {244 for (Map .Entry entry : map .entrySet ( ) ) {245 V old =put (index ,entry .getKey ( ) ,entry .getValue ( ) ) ;246 if (old == null ) {247 / / if no key was replaced , increment the index248 index++;249 } else {250 / / otherwise put the next item after the . . .251 index = indexOf (entry .getKey ( ) ) + 1 ;252 }253 }254 }

(b) The first (incorrect) fix applied by the project developers.

01 public void test0 ( ) throws Throwable {02 Object key1 = new Object ( ) ;03 Object key2 = new Object ( ) ;04 HashMap map = new HashMap ( ) ;05 map .put (key1 , null ) ;06 map .put (key2 , null ) ;07 ListOrderedMap listMap = new ListOrderedMap ( ) ;08 listMap .put (key1 , null ) ;09 listMap .put (key2 , null ) ;10 listMap .putAll ( 2 , map ) ;11 }

(c) automatically generated test case by STAR which can trigger undesired crashes inboth the original and the first fix version of the code

244 public void putAll (int index , Map map ) {245 for (Map .Entry entry : map .entrySet ( ) ) {246 K key = entry .getKey ( ) ;247 boolean contains = containsKey (key ) ;250 put (index , entry .getKey ( ) , entry .getValue ( ) ) ;251 if ( !contains ) {252 / / if no key was replaced , increment the index253 index++;254 } else {255 / / otherwise put the next item after the . . .256 index = indexOf (entry .getKey ( ) ) + 1 ;257 }258 }259 }

(d) The final fix applied by the project developers.

Fig. 12. A buggy method from ACC and a test casegenerated by STAR.

an undesired IndexOutOfBoundsException in both the originaland the fixed program. We reported4 this bug to the developersalong with the test case generated by STAR. The developersquickly confirmed and fixed the reported bug with the help ofthe submitted test case. Figure 12 (d) presents the final fixedcode which uses the containsKey method to check the existenceof any entry having the same key as the input entry beforeperforming put. To help test the future releases of the project,the crash reproducible test case generated by STAR was added5

to the official test suite of the project by the developers.This example demonstrates that, STAR can not only help

the debugging process of the reported software crashes, butalso help discover incomplete or incorrect bug fixes by thedevelopers.

4. https://issues.apache.org/jira/browse/collections-4745. http://svn.apache.org/r1496168




92 public UnboundedFifoBuffer (int initialSize ) {96 buffer = new Object [initialSize + 1 ] ;97 head = tail = 0 ;99 }

168 public boolean add (Object obj ) {195 if (++tail >= buffer .length )196 tail = 0 ;199 }

221 public Object remove ( ) {231 head += 1 ;238 }

273 public Iterator iterator ( ) {274 return new Iterator ( ) {276 private int index = head ;277 private int lastReturnedIndex = −1;

284 public Object next ( ) {288 lastReturnedIndex = index++;291 }

293 public void remove ( ) { . . .294 if (lastReturnedIndex == −1) { . . . }299 if (lastReturnedIndex == head ) { . . . }306 int i = lastReturnedIndex + 1 ;307 while (i != tail ) {308 if (i >= buffer .length ) {309 buffer [i − 1] = buffer [ 0 ] ;310 i = 0 ;311 } else {312 buffer[i - 1] = buffer[i]; / / AIOBE313 i++;314 }315 } . . .321 }323 } ;324 }

(a) buggy methods in ACC 3.1: the array access operation at line 312 could raise anArrayIndexOutOfBoundsException when i is 0

01 public void test0 ( ) throws Throwable {02 UnboundedFifoBuffer v1 = new UnboundedFifoBuffer ( 3 ) ;03 v1 .add (new java .lang .Object ( ) ) ;04 v1 .add (new java .lang .Object ( ) ) ;05 v1 .add (new java .lang .Object ( ) ) ;06 v1 .remove ( ) ;07 v1 .add (new java .lang .Object ( ) ) ;08 v1 .remove ( ) ;09 v1 .add (new java .lang .Object ( ) ) ;10 java .util .Iterator v2 = v1 .iterator ( ) ;11 v2 .next ( ) ;12 v2 .next ( ) ;13 v2 .remove ( ) ; / / crashed method14 }

(b) automatically generated test case by STAR

Fig. 13. Buggy methods from ACC and a test casegenerated by STAR.

Example 2 (ACC-53) Figure 13 (a) shows the methodsrelated to bug ACC-536. The source code has been slightlymodified to show only the code snippets that are related tothis bug7. The while loop from line 307 to 314 in methodUnboundedFifoBuffer.iterator.remove() is buggy, and could causean ArrayIndexOutOfBoundsException at line 312 when i is 0. Forexample, after executing line 308 to 310, variable i is set to0. At this time, if field tail is not 0 and buffer.length is greaterthan 0, the while loop is continued, and line 312 is executed.Because i has already been set to 0 in the last loop execution,an ArrayIndexOutOfBoundsException is raised at this line.

Figure 13 (b) presents the test case generated by STAR,

6. https://issues.apache.org/jira/browse/collections-537. The complete source code for UnboundedFifoBuffer is available at http:

//www.cse.ust.hk/~ning/star/ACC53-UnboundedFifoBuffer.java

which exploits the crash in (a). The crash exploitation bythis test case can be considered a useful reproduction of theoriginal crash for debugging because it contains the detailedmethod invocation steps to re-create the object state that causesthe program to crash. After executing line 12 of the test case,the object state for v2 is: {head == 2, tail == 1, buffer.length== 4, lastReturnedIndex == 3}. This object state satisfies theprecondition for triggering the ArrayIndexOutOfBoundsExceptionin the remove method at line 13. By following this methodsequence, developers can easily find out that the program codefrom line 307 to 314 is buggy, as it did not consider the caseof entering the loop after i had been reset to 0. Since thegenerated test case clearly demonstrates this crashing scenario,developers can quickly determine the bug. After understandingthe cause of this bug, it is easy to fix it.

Even though the total number of stack frame levels for thiscrash is only one, it is not easy to reproduce this crash as itonly happens in certain corner situations. In fact, the generatedtest case by STAR is the minimum test case that can reproducethe crash and reveal the crash triggering bug.

To further confirm the usefulness of this test case, we sent itto the developer who fixed this bug. The developer confirmedthe usefulness of this test case with the reply, “The auto-generated test case would reproduce the bug. . . . I think thathaving such a test case would have been useful.”

Example 3 (ANT-49755) Figure 14 (a) shows the buggymethod FileUtils.createTempFile in ANT-497558. This method isbuggy because it does not consider the case where parameterprefix is null. After invoking the File.createTempFile method witha null prefix in line 888, a NullPointerException is raised.

881 public File createTempFile (String prefix , . . . ) { . . .886 if (createFile )887 try {888 result = File.createTempFile(prefix, ...); / / NPE

. . .906 }

(a) buggy method in ANT 1.8.1: File.createTempFile invocation at line 888 could raisea NullPointerException if parameter prefix is null

01 public void test0 ( ) throws Throwable {02 java .lang .String v1 = new java .lang .String ( ) ;03 java .io .File v2 = new java .io .File (null , v1 ) ;04 java .lang .String v3 = String .valueOf (new char [ 1 ] ) ;05 TempFile v4 = new TempFile ( ) ;06 v4 .setProperty (v3 ) ;07 v4 .setDestDir (v2 ) ;08 v4 .setCreateFile (true ) ;09 v4 .execute ( ) ;10 }


Fig. 14. A buggy method from ANT and a test casegenerated by STAR.

Figure 14 (b) presents the test case generated by STAR,which exploits the crash in (a). This test case first prepares aTempFile object. Then, the crashing method TempFile.execute isinvoked, which in turn, calls the buggy FileUtils.createTempFilemethod. The TempFile object is prepared from Lines 02 to08, such that the execution with this object can reach thebuggy FileUtils.createTempFile method with the desired null

8. https://issues.apache.org/bugzilla/show_bug.cgi?id=49755




172 public String encode (String value ) {173 StringBuffer sb = new StringBuffer ( ) ;174 int len = value.length(); / / NPE

. . .207 }

(a) a crashed method in ANT 1.6.2: the dereference at line 174 could raise aNullPointerException if parameter value is null

01 public void test0 ( ) throws Throwable {02 DOMElementWriter v1 = new DOMElementWriter ( ) ;03 v1 .encode (null ) ;04 }


Fig. 15. An example of not useful test case generated bySTAR.

value for parameter prefix. Although STAR exploits this crashonly up to stack frame level 3 (the full stack has four framelevels), the crash exploitation by this test case is adequate toreveal the bug because it constructs a concrete scenario thatdemonstrates the incorrect behavior (i.e., passing a null prefixto File.createTempFile) in FileUtils.createTempFile. Therefore, itis considered a useful reproduction of the original crash.

Example 4 (ANT-33446) Figure 15 is a crash exploiting testcase generated by STAR for crash ANT-334469. Even thoughthis test case triggers the same exception at the same lineas the original crash, it is actually a typical example of anillegal method usage. The DOMElementWriter.encode method bydesign contract, should not be passed with a null parameter.Therefore, triggering a NullPointerException in this method bysimply passing a null parameter does not reveal the actualbug. In fact, the actual bug that causes the original crash liesin XmlLogger.buildFinished method in stack frame level four.Therefore, according to Criterion 2, we consider this and othersimilar illegal method usage cases Not Useful because theycannot reveal the actual bugs that cause the original crashes.

5.3.4 Developer SurveySince our usefulness evaluation may be subjective, we con-sulted developers to evaluate the usefulness of STAR. For eachexploited crash, we sent out a survey email to developers whohad fixed the corresponding bug. The survey email includedthe crash exploiting test case along with information about theoriginal bug.

We have received 6 responses out of 31 (19% responserate). While the number of responses was not large enoughto be conclusive, among the six responses, three test caseswere confirmed as useful by the developers. For the otherthree cases, the developers showed interest in our systemand gave us valuable feedback. For example, one developerconfirmed that the test case by STAR for ACC-35 is accurate,but the original crash can be easily debugged by just readingthe exception message. In this case, a crash exploiting testcase does not provide additional help. For ANT-41422, thedeveloper responded that test cases for their project shouldgenerally be XML script files rather than JUnit style testcases. However, the developer confirmed that the generatedtest case by STAR might provide a hint to construct their XMLscript files for debugging. Overall, we received promising

9. https://issues.apache.org/bugzilla/show_bug.cgi?id=33446

responses from developers. The complete developer responsesare available at http://www.cse.ust.hk/~ning/star/.

5.4 ComparisonTo further evaluate the effectiveness of STAR, we conducteda comparison study between STAR and two different frame-works, Randoop [45] and BugRedux [33].

Randoop is a state-of-the-art feedback-directed random testinput generation framework. It can generate thousands ofrandom method sequences to achieve high structural coverageand reveal bugs. We applied Randoop to generate methodsequences to reproduce our 52 subject crashes. Randoopwas executed with its default settings except for the timeoutoption and the classlist option. To increase the probabilityfor Randoop to generate crash reproducible sequences, we setits timeout option to 1000 seconds while the default value isonly 100 seconds. Further increasing the timeout option valuedoes not bring improvements to the results. Furthermore, wereduced the search scope of Randoop by providing the set ofclasses that is relevant to the target crash using the classlistoption. Only the classes which appear in the crash stacktrace are considered relevant and provided to Randoop. In thisway, Randoop can focus on generating method sequences foronly the crash related classes. Essentially, our setting favorsRandoop over STAR.

The second framework used in our comparison study isBugRedux [33]. Similar to STAR, BugRedux supports failurereproductions leveraging symbolic execution and constraintsolving techniques. BugRedux has several different variantswhich use different levels of crash information (i.e. the crashlocation, the crash stack trace, the call sequence and thecomplete execution trace) to reproduce crashes. Since the callsequence information and the complete execution trace werenot available without performing runtime monitoring of theoriginal program executions, we chose the BugRedux variant10

that makes use of the crash stack trace information for ourcomparison study. BugRedux was applied to generate testcases for each level of the crash stack frames using the samesettings as STAR.

Figure 16 presents the crash reproduction results achievedby Randoop, BugRedux and STAR for the 52 crashes. ThePrecondition category shows the number of crashes whosecrash triggering preconditions have been computed. SinceRandoop does not support crash precondition computation,its value is denoted as N/A. The Exploitation category showsthe number of crashes successfully exploited by the threeframeworks according to Criterion 1. The Usefulness categoryshows the number of crash exploitations which were identifiedas useful reproductions according to Criterion 2.

Overall, STAR outperformed both frameworks in all cat-egories. STAR outperformed Randoop in both the crash ex-ploitation and usefulness categories by 19 and 14 morecrashes respectively. The underlying reason is that, Randoopuses a randomized approach to construct method sequences.Due to the large search space of real world programs, theprobability for a randomized approach to generate non-trivial

10. We re-implemented it using Java as it was originally written in C/C++.




Num

ber o

f Cra

shes

Num

ber o

f Cra

shes

010

2030

4050

Precondition Exploitation Usefulness

N/A

18

38

12 10

31

8 7

22

RandoopBugReduxSTAR

Fig. 16. Crash reproduction results achieved by Randoop,BugRedux and STAR

inputs satisfying desired crash triggering object states is low.However, it is worth noting that, although Randoop is capableof reproducing crashes, it is not a framework that is speciallydesigned for crash reproduction. Instead, Randoop is a well-engineered test generation framework that can scale to verylarge programs and has been widely used in practice.

Similar to Randoop, STAR also outperformed BugReduxin the three categories by 20, 21 and 15 more crashesrespectively. STAR was able to achieve better results thanBugRedux for two main reasons. First, STAR leverages severaleffective optimizations to improve the efficiency of the crashprecondition computation process. Therefore, STAR couldsuccessfully compute crash preconditions which were toocomplex for BugRedux. Second, STAR introduces a novel testinput generation approach that can compose complex methodsequences for generating crash reproducible input objects. Asa result, crashes which required non-trivial input objects totrigger could be reproduced by STAR.

Overall, the comparison study demonstrates that STAR caneffectively outperform Randoop and BugRedux in crash re-production using stack traces.

6 DISCUSSION6.1 Major Challenges

STAR successfully exploited 31 crashes, among which, 22were considered useful for revealing the crash triggering bugs.However, there were still 30 crashes that were either notexploited or their exploitations were not considered useful. Toinvestigate the major challenges for reproducing these crashes,we manually inspected each crash.

Table 7 presents the major challenges which we haveidentified for the not reproduced crashes. In total, seven majorkinds of challenges have been identified. The numbers ofoccurrences of each challenge for the three subject programsare also listed in the table.

Environment dependency The most common challenge weidentified is the dependency of the crashes on the externalenvironment, such as file or network inputs. Crashes thatdepend on the external environment are difficult to reproducebecause STAR currently does not support the modificationof the external environment. This challenge accounts for 11(36.7%) out of the 30 crashes that could not be reproduced.

TABLE 7Major Challenges for Crash Reproduction

Challenge ACC ANT LOG Total

Environment Dependency 0 8 3 11 (36.7%)SMT Solver Limitations 2 2 3 7 (23.3%)

Concurrency 1 0 4 5 (16.7%)Path Explosion 2 0 0 2 (6.7%)

Reflection 0 1 0 1 (3.3%)Exception Handling 0 1 0 1 (3.3%)

Other 0 2 1 3 (10.0%)

The environment dependency challenge has also been identi-fied in related studies [55] as the primary reason for low testcoverage.

SMT solver limitations The second most common challengewe identified is the limitations of SMT solver. During both thecrash precondition computation and the test case generationprocesses, the SMT solver is used to validate the satisfactionsof different formulae. However, current state-of-the-art SMTsolvers such as Yices [28] and Z3 [26] have only limitedsupport for theories like non-linear arithmetic and floating-point arithmetic. Therefore, crashes related to these arithmeticsmay not be reproduced because of the limitations of the SMTsolver. Moreover, crashes related to string regular expressionsare also difficult to reproduce as they usually introduce com-plex string constraints which cannot be handled by the SMTsolver. Recent study by Erete et al. [29] also demonstratesweaknesses in SMT solvers when used in symbolic execution.In total, 7 (23.3%) not reproduced crashes fall within thiscategory.

Concurrency and non-determinism STAR currently is de-signed to reproduce single-threaded deterministic crashes.However, there are also many crashes which are only re-producible under concurrent or non-deterministic executions.Therefore, the extension of STAR to support concurrent andnon-deterministic crashes is an important future work. In total,5 (16.7%) not reproduced crashes fall within this category.

Path explosion Both the crash precondition computationprocess and the test case generation process may suffer fromthe path explosion problem. Although STAR leverages differ-ent optimization heuristics to improve the efficiency of bothprocesses, there are still crashes too complex to reproduce.The path explosion problem may be alleviated by adding morecomputation resources or introducing additional optimizationheuristics, but it is difficult to solve completely. In total, 2(6.7%) not reproduced crashes fall within this category.

Other challenges Other than the previous common chal-lenges, there are also challenges which are less common butworth noting. The first one is Reflection. Crashes related tothe Reflection mechanism are usually difficult to reproduce be-cause the specific classes or methods triggering these crashescan only be determined during runtime. The second challengeworth noting is exception handling. It means that a crash canonly be triggered if a particular exception has been previouslytriggered.




6.2 Future WorkTo improve the effectiveness of STAR, we identify severalimportant future work directions.

The first potential future work for STAR is to add the abilityto modify the external environment. Currently, many crashesare not reproduced because they have dependencies to theexternal environment such as the content of a file or the stateof a network connection. To reproduce these crashes, STARneeds to be able to modify the external environment accordingto the crash triggering preconditions.

Another potential future work for STAR is to integratemore specialized constraint solvers to overcome some of theweaknesses of Yices. For example, for crashes involving non-trivial string regular expressions or complex string operations,Yices may not be able to compute a feasible input model forreproducing the crashes. Therefore, the integration of a morespecialized string constraint solver such as HAMPI [34] orZ3-str [64] may greatly improve the reproducibility of STARfor this kind of crashes.

Currently, STAR only supports three major types of ex-ceptions. To extend STAR’s applicability, we plan to addsupport for more exception types in the future versions. Ingeneral, it is harder to support exceptions whose initial crashconditions are difficult to infer, such as ClassNotFoundExceptionand InterruptedException. To support these exceptions, STARneeds to have more information on the global program stateand the runtime environment at the time of the crash. Oncontrast, for exceptions like ClassCastException, whose initialcrash conditions are relatively easy to infer, they can besupported more easily.

Finally, as concurrent programs are becoming more popular,it is also desired that we extend STAR to support concurrentcrashes.

6.3 Threats to ValidityWe identify the following threats to validity of our approachand the evaluation study.

Implementation might have faults. Considering the com-plexity of the Java semantics, we could have introducedbugs in our implementation. We have carefully reviewed theimplementation and tried to eliminate any bugs found in thecode. We also welcome users to report any bugs found inSTAR.

The subject systems might not be representative. In our eval-uation study, we used three open source projects, ACC, ANTand LOG as subjects, because their bug reporting systemshave high quality bug reports containing crash stack traces.We might have a subject selection bias. Also, since they areopen source projects, they may not be representative of closed-source projects.

To mitigate the subject selection bias, we have selected threesubjects with very different functionalities: data containers,build system and runtime logging. Despite the differences infunctionality, all three subjects have a high quality code baseand are widely used by industry. In our future studies, we mayalso evaluate STAR on closed-source projects.

Usefulness evaluation could be subjective. We carefullyevaluated the usefulness of STAR by inspecting the generatedtest cases along with the actual bug patches and consultingdevelopers. However, the evaluation could still be subjective.Therefore, we put all generated test cases online for examine.

7 RELATED WORK

Many record-replay approaches [12], [23], [43], [50], [52]have been proposed for reproducing crashes. However, record-replay approaches yield high performance overhead becausethey need to monitor program executions. Therefore, these ap-proaches try to reduce the performance overhead by reducingthe amount of data needed to record during executions. Forexample, ReCrash [12] performs a light-weight recording ofthe program state during executions. Record-replay approachesfor reproducing concurrency failures [9], [37], [47] try toreduce the overhead by recording only partial of the executiontrace and thread scheduling. To be able to replay concurrencyfailures with only partial execution data, they rely on more in-telligent execution replayers which employ various techniquessuch as data race inference [9] or probabilistic replay [47] toguide their execution replay processes. Techniques [61], [62]have also been proposed to use clues from existing applicationlogs instead of execution traces to help diagnose crashes.

In contrast to record-replay approaches, STAR is a purestatic approach. Since it does not monitor software executions,no performance overhead is incurred. Besides performanceoverhead, record-replay approaches need to modify the de-ployment environment to collect runtime information fromclient site to reproduce crashes. However, STAR does notrequire any changes to the existing deployment environment.Due to these fundamental differences, we do not compareSTAR with record-replay approaches in our experiments.

Other than the record-replay approaches, post-failure-process approaches [20], [33], [38], [49], [63] have also beenproposed. Post-failure-process approaches conduct analysis oncrashes only after they have occurred. Since they do notrecord runtime information during executions, they do notincur performance overhead.

Manevich et al. proposed PSE [38], a typical post-failure-process approach which performs postmortem data-flow anal-ysis to explain program failures with minimal information.STAR is different from PSE in two major aspects. First,STAR is able to compute the crash triggering preconditionusing a backward symbolic execution. PSE, however, onlyidentifies the trace of the crash triggering value using a data-flow analysis. Second, PSE only tries to explain a failure byidentifying possible error traces, while STAR aims at bothidentifying the complete crash path and constructing real testcases that can actually reproduce the crash.

Besides explaining failures, approaches have also beenproposed to actually reproduce the crashes using various post-mortem program data. Rossler et al. proposed RECORE [49],an execution synthesis approach which analyzes the core dump(i.e. a snapshot of the stack and heap memory) extractedat the time of the crash to generate crash reproducible testcases. RECORE uses an evolutionary search-based technique




to generate test cases that can reproduce the original crashes.Weeratunge et al. also proposed an approach [58] that achievesthe reproduction of concurrency failures using core dumps.Leitner et al. proposed Cdd [35], [36], an approach that usesboth the core dump and the developer written contracts in theprogram code to facilitate auto-generation of crash reproduc-ing test cases. STAR is different from all these approaches as itdoes not rely on core dumps to generate crash reproducible testcases. In reality, only a limited number of applications supportthe generation and submission of the core dump at the timeof a crash. Therefore, approaches relying on core dumps maynot be widely applicable. However, in the case where boththe core dump and the crash stack trace are available, theseapproaches and STAR may be applied together to have a betterchance of reproducing the crash.

Recently, approaches that do not rely on the core dump havealso been proposed. Zamfir et. al. proposed ESD, an executionsynthesis approach which automatically synthesizes failureexecutions using only the stack trace information (though,the stack trace information is still extracted from the coredump). Jin et al. proposed BugRedux [33], an approach whichaims to reproduce crashes by leveraging different levels of theexecution information, including the crash stack trace. STAR issimilar to these approaches in that all of them try to reproducecrashes in-house with limited information (e.g. the crash stacktrace) from bug reports.

STAR is different from ESD and BugRedux in two majoraspects. First, both ESD and BugRedux rely on the forwardsymbolic execution to achieve execution synthesis. Since aforward symbolic execution is non-demand-driven, it hasto explore many paths that are not relevant to the targetcrash. Instead, STAR uses a backward symbolic execution-based approach for crash precondition computation. Sincethe backward symbolic execution is demand-driven, STARneeds to explore only paths that are relevant to the crash.Moreover, a backward symbolic execution-based approach canbe more effectively optimized by approaches such as staticpath pruning. This is because most crash related information(e.g. variables contributive to the crash) near the crash locationis available soon after the backward execution starts. Becauseof these reasons, STAR’s crash precondition computation isapplicable to real world complex programs with tens or evenhundreds of thousands of lines of code. The second majordifference between STAR and these approaches is that, bothESD and BugRedux do not handle the object creation chal-lenge [60] which is a major challenge for reproducing object-oriented crashes. STAR introduces a novel test input generationapproach that can compose complex method sequences forgenerating crash triggering input objects. Therefore, crasheswhich required non-trivial input objects to trigger, could bereproduced by STAR. Our evaluation study (Section 5.4) showsthat STAR can outperform BugRedux in both the ability tocompute crash triggering preconditions and the ability togenerate useful crash reproducible test cases.

Generating input objects to satisfy desired object statescan be achieved either by directly assigning values to themember fields or by composing method sequences to createand mutate the target objects. Boyapati et al. [17] proposed

Korat, an approach to create desired input objects throughdirect field assignments. However, this kind of approachesrelies on specifications like the class invariant informationwritten by developers, which are rarely available.

Instead, method sequence generation approaches are morecommonly used. JCrasher [24] and Randoop [45] generaterandom method sequences to achieve high structural cover-age. Such approaches can generate large numbers of methodsequences. However, due to the large search space of realworld programs, the probabilities for randomized approachesto generate non-trivial input objects satisfying desired objectstates are low.

Thummalapenta et al. proposed Seeker [55], a methodsequence generation approach that uses the combination ofdynamic symbolic execution and static analysis to synthesizemethod sequences for achieving target object states. Themethod sequence composition phase of STAR is similar toSeeker in that both approaches perform backward generationsof method sequences from the goal object states to initialprimitive values. However, Seeker relies on the use of dynamicsymbolic execution [56] to identify feasible paths to cover theintermediate goals during its generation process. Essentially,whenever a candidate sequence is suggested by the staticanalyzer, Seeker has to perform dynamic symbolic executionson all paths within the suggested candidate sequence untila feasible path is found. In contrast, STAR does not relyon dynamic symbolic execution to achieve method sequencecomposition. Instead of executing all paths in the candidatesequence both symbolically and dynamically like Seeker,STAR makes use of the precise method summary informationwhich only needs to be computed once. The evaluation of amethod path by STAR can be achieved simply by performingan SMT check on a validating formula. In general, Seeker isdesigned and optimized to achieve higher code coverage whileSTAR focuses on generating test cases to cover a particularcrashing path.

Various approaches [16], [48], [53], [57] have also been pro-posed to improve different aspects of (dynamic) symbolic exe-cution. For example, Boonstoppel et al. proposed RWset [16],an optimization technique for reducing the number of pathstraversed by forward symbolic execution frameworks suchas [18], [19]. Their technique identifies and discards pathswhich have the same side-effects as previously traversed paths.STAR’s static path pruning approach is similar to RWsetin that it also identifies and prunes away paths from non-contributive conditional blocks. However, STAR’s approachis different from RWset in several major aspects. First, it isdesigned to identify and discard non-contributive conditionalblocks instead of just individual paths. Second, it leveragescrash related information, such as the exception triggeringvariables to guide the non-contributive block identification andpruning process. Finally, STAR’s approach has been designedto work with backward instead of forward symbolic execution.As previously stated, backward symbolic execution is poten-tially more suitable for the crash precondition computation.Similarly, Taneja et al. proposed eXpress [53], an approach forimproving the efficiency of regression test generation for path-exploration-based test generation techniques. STAR’s static




path pruning approach is also quite different from eXpress.eXpress is designed to identify and prune conditional blocksthat cannot reach or infect the newly added statements inregression testing. While STAR’s static path pruning approachprunes conditional blocks whose effects do not contribute tothe target crash.

For improving constraint solving during symbolic execution,Pasareanu et al. proposed an approach [48] to split pathconditions into solvable and complex non-linear constraintsthat cannot be handled by the SMT solvers. Their approachuses the concrete solutions from the solvable constraints tohelp the solution of the complex constraints. Visser et al. [57]proposed Green, a framework for storing and reusing previousconstraint solutions to speedup the solution of new constraints.Different from these approaches which focus on improvingthe capability or efficiency of the constraint solving process,STAR’s guided backtracking approach aims to improve theoverall traversal of the symbolic execution by leveraging thesolution information from the constraint solving process.

8 CONCLUSIONSThis paper has presented STAR, an automatic crash reproduc-tion framework using only crash stack traces. Our experimentson 52 real world crashes from ACC, ANT and LOG showedthat STAR successfully exploited 31 (59.6%) of the 52 crashes.Among these exploited crashes, 22 (42.3%) were useful repro-ductions of the original crashes for debugging as they couldreveal the actual crash triggering bugs. A comparison studybetween STAR and two existing approaches also demonstratedthat STAR effectively outperforms these approaches in crashreproduction using only stack traces.

STAR can be applied to existing bug reporting systems with-out incurring additional performance overhead. Automaticallyreproducing crashes significantly reduces the time and effortspent by developers on crash debugging.

The implementation of STAR and the experiment data arepublicly available at http://www.cse.ust.hk/~ning/star/.

REFERENCES[1] Apache Software Foundation. Apache Ant. http://ant.apache.org/.[2] Apache Software Foundation. Apache Commons Collections. http://

commons.apache.org/collections/.[3] Apache Software Foundation. Apache log4j. http://logging.apache.org/

log4j/.[4] Atlassian Inc. JIRA. http://www.atlassian.com/software/jira/.[5] Google Inc. Breakpad. http://code.google.com/p/google-breakpad/.[6] Mozilla Foundation. Talkback. http://talkback.mozilla.org.[7] Crashreporter. Technical Report TN2123, Apple Inc., Cupertino, CA,

2004.[8] H. Agrawal and J. R. Horgan. Dynamic program slicing. In Proc. PLDI,

pages 246–256, New York, NY, USA, 1990. ACM.[9] G. Altekar and I. Stoica. Odr: output-deterministic replay for multicore

debugging. In Proc. SOSP, pages 193–206, New York, NY, USA, 2009.ACM.

[10] S. Anand, P. Godefroid, and N. Tillmann. Demand-driven compositionalsymbolic execution. In Proc. TACAS, pages 367–381, Berlin, Heidelberg,2008.

[11] L. O. Andersen. Program Analysis and Specialization for the CProgramming Language. PhD thesis, University of Copenhagen, DIKU,1994.

[12] S. Artzi, S. Kim, and M. D. Ernst. ReCrash: Making Software FailuresReproducible by Preserving Object States. In Proc. ECOOP, pages542–565, 2008.

[13] D. Babic and A. J. Hu. Calysto: Scalable and precise extended staticchecking. In Proc. ICSE, pages 211–220, New York, NY, USA, 2008.ACM.

[14] K. Bartz, J. W. Stokes, J. C. Platt, R. Kivett, D. Grant, S. Calinoiu, andG. Loihle. Finding similar failures using callstack similarity. In Proc.SysML, pages 1–1, Berkeley, CA, USA, 2008. USENIX Association.

[15] N. Bettenburg, S. Just, A. Schröter, C. Weiss, R. Premraj, and T. Zim-mermann. What makes a good bug report? In Proc. FSE, pages 308–318,New York, NY, USA, 2008. ACM.

[16] P. Boonstoppel, C. Cadar, and D. R. Engler. Rwset: Attacking pathexplosion in constraint-based test generation. In Proc. TACAS, pages351–366, 2008.

[17] C. Boyapati, S. Khurshid, and D. Marinov. Korat: automated testingbased on Java predicates. In Proc. ISSTA, pages 123–133, 2002.

[18] C. Cadar, D. Dunbar, and D. Engler. Klee: unassisted and automaticgeneration of high-coverage tests for complex systems programs. InProc. OSDI, pages 209–224, Berkeley, CA, USA, 2008. USENIXAssociation.

[19] C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler. Exe:automatically generating inputs of death. In Proc. 13th ACM Conferenceon Computer and Communications Security, pages 322–335, 2006.

[20] S. Chandra, S. J. Fink, and M. Sridharan. Snugglebug: A powerfulapproach to weakest preconditions. In Proc. PLDI, pages 363–374,New York, NY, USA, 2009. ACM.

[21] N. Chen and S. Kim. Puzzle-based automatic testing: Bringing humansinto the loop by solving puzzles. In Proc. ASE, pages 140–149, NewYork, NY, USA, 2012. ACM.

[22] H. Cibulski and A. Yehudai. Regression test selection techniques fortest-driven development. In Proc. ICSTW, pages 115 –124, march 2011.

[23] J. Clause and A. Orso. A Technique for Enabling and Supporting De-bugging of Field Failures. In Proc. ICSE, pages 261–270, Minneapolis,Minnesota, May 2007.

[24] C. Csallner and Y. Smaragdakis. JCrasher: an automatic robustness testerfor Java. Software: Practice and Experience, 34:1025–1050, 2004.

[25] B. Daniel and M. Boshernitsan. Predicting effectiveness of automatictesting tools. In Proc. ASE, pages 363–366, Washington, DC, USA,2008.

[26] L. M. de Moura and N. Bjørner. Z3: An efficient SMT solver. In Proc.TACAS, pages 337–340, 2008.

[27] E. W. Dijkstra. A Discipline of Programming. Prentice Hall PTR, UpperSaddle River, NJ, USA, 1997.

[28] B. Dutertre and L. de Moura. System description: Yices 1.0. In Proc.SMT-COMP, 2006.

[29] I. Erete and A. Orso. Optimizing constraint solving to better supportsymbolic execution. In Proc. CSTVA, pages 310 –315, March 2011.

[30] K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan,G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very)large: ten years of implementation and experience. In Proc. SOSP, pages103–116, New York, NY, USA, 2009. ACM.

[31] P. Godefroid. Compositional dynamic test generation. In Proc. POPL,pages 47–54, 2007.

[32] H. Jaygarl, S. Kim, T. Xie, and C. K. Chang. OCAT: Object Capture-based Automated Testing. In Proc. ISSTA, July 2010.

[33] W. Jin and A. Orso. Bugredux: Reproducing field failures for in-housedebugging. In Proc. ICSE, June 2012.

[34] A. Kiezun, V. Ganesh, P. J. Guo, P. Hooimeijer, and M. D. Ernst. Hampi:a solver for string constraints. In Proc. ISSTA, pages 105–116, NewYork, NY, USA, 2009. ACM.

[35] A. Leitner, I. Ciupa, M. Oriol, B. Meyer, and A. Fiva. Contract drivendevelopment = test driven development - writing test cases. In Proc.FSE, pages 425–434, New York, NY, USA, 2007. ACM.

[36] A. Leitner, A. Pretschner, S. Mori, B. Meyer, and M. Oriol. On theeffectiveness of test extraction without overhead. In Proc. ICST, pages416–425, Washington, DC, USA, 2009. IEEE Computer Society.

[37] Q. Luo, S. Zhang, J. Zhao, and M. Hu. A lightweight and portableapproach to making concurrent failures reproducible. In Proc. FASE,pages 323–337, Berlin, Heidelberg, 2010. Springer-Verlag.

[38] R. Manevich, M. Sridharan, S. Adams, M. Das, and Z. Yang. Pse:explaining program failures via postmortem static analysis. In Proc.FSE, pages 63–72, New York, NY, USA, 2004. ACM.

[39] J. McCarthy. A basis for a mathematical theory of computation.Technical report, MIT, Cambridge, MA, USA, 1962.

[40] J. McCarthy. Towards a mathematical science of computation. InInformation Processing 62: Proceedings of IFIP Congress 1962, pages21–28, Amsterdam, 1963. North-Holland.




[41] J. Nam and N. Chen. Mining crash fix patterns. Technical report, TheHong Kong University of Science and Technology, Hong Kong, 2010.http://www.cse.ust.hk/~ning/star/mining-fix-patterns.pdf.

[42] M. G. Nanda and S. Sinha. Accurate interprocedural null-dereferenceanalysis for Java. In Proc. ICSE, pages 133–143, Washington, DC, USA,2009. IEEE Computer Society.

[43] S. Narayanasamy, G. Pokam, and B. Calder. Bugnet: Continu-ously recording program execution for deterministic replay debugging.SIGARCH Comput. Archit. News, 33(2):284–295, 2005.

[44] F. Nielson, H. R. Nielson, and C. Hankin. Principles of ProgramAnalysis. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1999.

[45] C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball. Feedback-directedrandom test generation. In Proc. ICSE, pages 75–84, Minneapolis, MN,USA, May 23–25, 2007.

[46] C.-S. Park and K. Sen. Randomized active atomicity violation detectionin concurrent programs. In Proc. FSE, pages 135–145, New York, NY,USA, 2008. ACM.

[47] S. Park, Y. Zhou, W. Xiong, Z. Yin, R. Kaushik, K. H. Lee, and S. Lu.Pres: probabilistic replay with execution sketching on multiprocessors.In Proc. SOSP, pages 177–192, New York, NY, USA, 2009. ACM.

[48] C. S. Pasareanu, N. Rungta, and W. Visser. Symbolic execution withmixed concrete-symbolic solving. In Proc. ISSTA, pages 34–44, NewYork, NY, USA, 2011. ACM.

[49] J. Rossler, A. Zeller, G. Fraser, C. Zamfir, and G. Candea. Reconstruct-ing core dumps. In Proc. ICST, pages 114–123, 2013.

[50] D. Saff, S. Artzi, J. H. Perkins, and M. D. Ernst. Automatic test factoringfor Java. In Proc. ASE, pages 114–123, November 2005.

[51] N. Serrano and I. Ciordia. Bugzilla, itracker, and other bug trackers.IEEE Software, 22:11–13, 2005.

[52] J. Steven, P. Chandra, B. Fleck, and A. Podgurski. jRapture: Acapture/replay tool for observation-based testing. In Proc. ISSTA, pages158–167, 2000.

[53] K. Taneja, T. Xie, N. Tillmann, and J. de Halleux. express: guided pathexploration for efficient regression test generation. In Proc. ISSTA, pages1–11, New York, NY, USA, 2011. ACM.

[54] S. Thummalapenta, T. Xie, N. Tillmann, J. de Halleux, and W. Schulte.Mseqgen: Object-oriented unit-test generation via mining source code.In Proc. ESEC/FSE, pages 193–202, New York, NY, USA, 2009. ACM.

[55] S. Thummalapenta, T. Xie, N. Tillmann, J. de Halleux, and Z. Su.Synthesizing method sequences for high-coverage testing. In Proc.OOPSLA, pages 189–206, New York, NY, USA, 2011. ACM.

[56] N. Tillmann and J. de Halleux. Pex-white box test generation for .NET.In Proc. TAP, pages 134–153, 2008.

[57] W. Visser, J. Geldenhuys, and M. B. Dwyer. Green: reducing, reusingand recycling constraints in program analysis. In Proc. FSE, pages58:1–58:11, New York, NY, USA, 2012. ACM.

[58] D. Weeratunge, X. Zhang, and S. Jagannathan. Analyzing multicoredumps to facilitate concurrency bug reproduction. In Proc. ASPLOS,pages 155–166, New York, NY, USA, 2010. ACM.

[59] M. Weiser. Program slicing. IEEE Transactions on Software Engineer-ing, SE-10(4):352–357, 1984.

[60] X. Xiao, T. Xie, N. Tillmann, and J. de Halleux. Precise identification ofproblems for structural test generation. In Proc. ICSE, pages 611–620,New York, NY, USA, 2011. ACM.

[61] D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. Sherlog:error diagnosis by connecting clues from run-time logs. In Proc.ASPLOS, pages 143–154, New York, NY, USA, 2010. ACM.

[62] D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage. Improving softwarediagnosability via log enhancement. In Proc. ASPLOS, pages 3–14, NewYork, NY, USA, 2011. ACM.

[63] C. Zamfir and G. Candea. Execution synthesis: a technique forautomated software debugging. In Proc. EuroSys, pages 321–334, NewYork, NY, USA, 2010. ACM.

[64] Y. Zheng, X. Zhang, and V. Ganesh. Z3-str: A z3-based string solverfor web application analysis. In Proc. FSE, pages 114–124, New York,NY, USA, 2013. ACM.

Documents

JOURNAL OF LA STAR: Stack Trace based Automatic Crash ...hunkim/papers/chen-tse2015.pdf · can effectively outperform existing crash reproduction approaches. Index Terms—Crash reproduction,