Automatically Generated Patches as Debugging Aids: A Human ...people.csail.mit.edu/hunkim/papers/tao-fse2014.pdf · Debugging is of paramount importance in the process of software

Automatically Generated Patches as Debugging AidsA Human Study

Yida Tao1 Jindae Kim1 Sunghun Kimlowast1 and Chang Xulowast2

1Dept of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong China2State Key Lab for Novel Software Technology and Dept of Computer Sci and Tech Nanjing University Nanjing China

idagoojdkimhunkimcseusthkchangxunjueducn

ABSTRACTRecent research has made significant progress in automaticpatch generation an approach to repair programs with lessor no manual intervention However direct deployment ofauto-generated patches remains difficult for reasons such aspatch quality variations and developersrsquo intrinsic resistance

In this study we take one step back and investigate amore feasible application scenario of automatic patch gen-eration that is using generated patches as debugging aidsWe recruited 95 participants for a controlled experiment inwhich they performed debugging tasks with the aid of ei-ther buggy locations (ie the control group) or generatedpatches of varied qualities We observe that a) high-qualitypatches significantly improve debugging correctness b) suchimprovements are more obvious for difficult bugs c) whenusing low-quality patches participantsrsquo debugging correct-ness drops to an even lower point than that of the controlgroup d) debugging time is significantly affected not by de-bugging aids but by participant type and the specific bugto fix These results highlight that the benefits of using gen-erated patches as debugging aids are contingent upon thequality of the patches Our qualitative analysis of partici-pantsrsquo feedback further sheds light on how generated patchescan be improved and better utilized as debugging aids

Categories and Subject DescriptorsD25 [Testing and Debugging] Debugging aids

General TermsExperimentation Human Factors

KeywordsDebugging automatic patch generation human study

lowastCorresponding authors

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requires prior specificpermission andor a feeFSE rsquo14 November 16ndash22 2014 Hong Kong ChinaCopyright 2014 ACM 978-1-4503-3056-51411 $1500

1 INTRODUCTIONDebugging is of paramount importance in the process of

software development and maintenance given tens or evenhundreds of bugs reported daily from individual projects [2]However debugging is dauntingly challenging as bugs cantake days or even months to resolve [2][3]

To alleviate developersrsquo burden of fixing bugs automaticpatch generation is proposed to repair programs with lessor no manual intervention Recent research in this areahas made significant progress [6][17][21][22][26][29][33] LeGoues et al proposed a technique based on evolutionarycomputation and used it to successfully repair 55 out of105 real-world bugs [21] Other advanced techniques suchas runtime modifications [6][29] program synthesis usingcode contracts and formal specifications [26][33] and pat-tern adaptation [17] have also produced promising results interms of the repair success-rate

Despite these fruitful research outcomes direct deploy-ment of auto-generated patches in production code seemsunrealistic at this point Kim et al suggested that gener-ated patches are sometimes non-sensical and thus less likelyto be accepted by developers [17] Research has also identi-fied other reasons why developers are not comfortable withblindly trusting auto-generated code For example gener-ated code might be less readable and maintainable [7][13]

Rather than direct deployment a more feasible applica-tion of auto-generated patches would be to aid debuggingsince they not only pinpoint buggy locations but also sug-gest candidate fixes Yet developers can still judge whethercandidate fixes are correct and whether to adapt or discardgenerated patches

A natural question thus arises from the above scenarioare auto-generated patches useful for debugging In addi-tion generated patches vary in quality which affects boththe functional correctness and future maintainability of theprogram being patched [13] Research has also found thatthe quality of automated diagnostic aids affects usersrsquo re-liance and usages of them [23] As such does a patchrsquos qual-ity affect its usefulness as a debugging aid

In this paper we conduct a large-scale human study toaddress these two questions We recruited 95 participantsincluding 44 graduate students 28 software engineers and23 Amazon Mechanical Turk (MTurk) [1] users to individ-ually fix five real bugs Each bug was aided by one of thethree hints its low-quality or high-quality generated patchor its buggy location (method name) as the control condi-tion In total we collected 337 patches submitted by theparticipants

Grad 44

Engr 28

MTurk23

Section 21 Bug Selection

High-quality generated patch

Low-quality generated patch

Location (the control group)

Section 21 Debugging Aids Section 22

Participants 337 patches

Section 24 Evaluation Measures

Section 23 Study Design

Feedback

Debugging time

Ground truth patches

Figure 1 Overview of the experimental process We create debugging tasks using five real bugs Each bug is aided by one ofthe three hints its generated patch of low or high quality or its buggy location Our participants include computer sciencegraduate students software engineers and Amazon Mechanical Turk (MTurk) users Graduate students debug using theEclipse IDE in onsite sessions while software engineers and MTurk participants debug using our web-based debugging systemWe evaluate participantsrsquo patch correctness debugging time and exit-survey feedback

We observe a significant increase in the debugging correct-ness when participants were aided by the high-quality gen-erated patches and such an improvement was more obviousfor difficult bugs Nevertheless participants were adverselyinfluenced by the low-quality generated patches as their de-bugging correctness dropped surprisingly to an even lowerpoint than the control group This finding urges a strictquality control for generated patches if we were to use themas debugging aids Otherwise incorrect or unmaintainablepatches can indeed cloud developersrsquo judgement and deteri-orate their debugging correctness

Interestingly the types of debugging aid have only mar-ginal influence on participantsrsquo debugging time while thetypes of bugs and participant populations instead have great-er influence For example we observe a significant slow-down in debugging time for the most difficult bug and asignificant speed-up for software engineers and MTurk par-ticipants compared to graduate students

We also qualitatively analyze participantsrsquo attitudes to-ward using auto-generated patches Participants gave gener-ated patches credit for quick problem identification and sim-plification However they remained doubtful about whethergenerated patches can fix complicated bugs and address rootcauses Participants also claimed that they could have beenmore confident and smooth in using generated patches ifthey had deeper understanding of the buggy code its con-text and its test cases

Overall this paper makes the following contributions

bull A large-scale human study for assessing the usefulnessof auto-generated patches as debugging aids

bull An in-depth quantitative evaluation on how debuggingis influenced by the quality of generated patches andother factors such as the types of bugs and partici-pants

bull A qualitative analysis of patch usages which shedslight on potential directions of automatic patch gen-eration research

The remainder of this paper is organized as follows Sec-tion 2 introduces our experimental process Section 3 andSection 4 report and discuss the study results Section 5presents threats to validity followed by a survey of relatedwork in Section 6 Section 7 concludes the paper

2 EXPERIMENTAL PROCESSThis section presents detailed experimental process fol-

lowing the overview of Figure 1

21 Bug Selection and Debugging AidsTo evaluate the usefulness of auto-generated patches in

debugging we select bugs that satisfy the following criteria

bull To simulate real debugging scenarios real-world bugsare favored over seeded ones [14] and the diversity ofbug types is preferred

bull Patches of varied qualities can be automatically gen-erated for the chosen bugs

bull Patches written and accepted by the original develop-ers should be available as the ground truth to evaluateparticipantsrsquo patches

bull A proper number of bugs should be selected to controlthe length of the human study [16]

Accordingly we selected five bugs reported by Kim etal [17] and summarize them in Table 1 Four of the bugsare from Mozilla Rhino1 and one is from Apache CommonsMath2 These bugs manifest different symptoms and covervarious defect types and all of them have been fixed bydeveloper-written and verified patches

Kim et al reported that all five bugs can be fixed by twostate-of-the-art program repair techniques GenProg [21] andPAR [17] such that their generated patches can pass all thecorresponding test cases [17] Kim et al launched a surveyin which 85 developers and CS students were asked to rankthe acceptability of the generated patches for each bug [17]Table 2 reports these ten generated patches (two patches foreach bug) along with their average acceptability rankings re-ported by Kim et al [17]

While the generated patches are equally qualified in termsof passing test cases humans rank patch acceptability byfurther judging their fix semantics and understandabilitywhich are the primary concerns of patch quality [13][15]Hence we inherited this acceptability ranking reported byKim et al [17] as an indicator for patch quality Specificallyfor the two patches generated for each bug we labeled theone with a higher ranking as high-quality and the other onewith a lower ranking as low-quality

Since generated patches already reveal buggy locationsand suggest candidate fixes for fair comparison we pro-vide names of buggy methods to the control group insteadof leaving it completely unaided This design conforms toreal debugging scenarios where developers are aware of basicfaulty regions for example from detailed bug reports [14]1httpwwwmozillaorgrhino2httpcommonsapacheorgpropercommons-math

Table 1 Description of the five bugs used in the study We list semantics of developersrsquo original fixes which are used toevaluate the correctness of participantsrsquo patches

Bug Buggy code fragment and symptom Semantics of original developersrsquo patchM

-280 1 public static double [] bracket (

2 double lowerBound double upperBound )3 if (fa fb gt= 00 ) 4 throw new ConvergenceException(5

ConvergenceException is thrown at line 4

bull If fafb == 00 then the method bracket() shouldterminate without ConvergenceException regardlessof given lowerBound and upperBound values

bull If fafb gt 00 then the method bracket() shouldthrow ConvergenceException

1 if (lhs == undefined) 2 lhs = strings[getShort(iCode pc + 1)]3

ArrayIndexOutOfBoundsException is thrown at line 2

bull If getShort(iCode pc + 1) returns a valid indexthen strings[getShort(iCode pc + 1)] should beassigned to lhs

bull If getShort(iCode pc + 1) returns -1 the programshould return normally without AIOBE and lhs shouldbe undefined

92226 1 private void visitRegularCall(Node node int

2 type Node child boolean firstArgDone)3 4 String simpleCallName = null5 if (type = TokenStreamNEW)6 simpleCallName = getSimpleCallName(node)7 8 child = childgetNext ()getNext ()

NullPointerException is thrown at line 8

bull Statements in the block of if statement is executedonly if firstArgDone is false

bull simpleCallName should be initialized with a returnvalue of getSimpleCallName(node)

1 for (int i = 0 i lt parenCount i++) 2 SubString sub = (SubString) parensget(i)3 args[i+1] = subtoString ()4

bull If parensget(i) is not null assign it to args[i+1]for all i = 0parenCount-1

bull The program should terminate normally withoutNullPointerException when sub is null

1 for (int i = num i lt stateparenCount i++)2 stateparens[i] length = 03 stateparenCount = num

bull The program should terminate without NullPoint-erException even if stateparens[i] is equal to null

bull If for statement is processing the array stateparensit should be processed from index num tostateparenCount-1

For simplicity we refer to participants given different aidsas the LowQ (low-quality patch aided) HighQ (high-qualitypatch aided) and Location (buggy location aided) group

22 ParticipantsGrad We recruited 44 CS graduate students mdash 12 from

The Hong Kong University of Science and Technology and32 from Nanjing University Graduate students typicallyhave some programming experience but in general they arestill in an active learning phase In this sense they resemblenovice developers

Engr We also invited industrial software engineers torepresent the population of experienced developers In thisstudy we recruited 28 software engineers from 11 companiesvia email invitations

MTurk We further recruited participants from AmazonMechanical Turk a crowdsourcing website for requesters topost tasks and for workers to earn money by completingtasks [1] The MTurk platform potentially widens the va-riety of our participants since its workers have diverse de-mographics such as nationality age gender education andincome [32] To safeguard worker quality we prepared abuggy code revised from Apache Commons Collection Is-sue 359 and asked workers to describe how to fix it Only

those who passed this qualifying test could proceed to ourdebugging tasks We finally recruited 23 MTurk workersincluding 20 developers two undergraduate students andone IT manager with 1-14 (on average 57) years of Javaprogramming experience

23 Study DesignWe provided detailed bug descriptions from Mozilla and

Apache bug reports to participants as their starting pointWe also provided developer-written test cases which in-cluded the failed ones that reproduced the bugs Althoughparticipants were encouraged to fix as many bugs as theycould they were free to skip any bug as the completion ofall five tasks was not mandatory To conform to partici-pantsrsquo different demographics we adopted the onsite andonline settings as introduced below

Onsite setting applies to the graduate students Weadopted a between-group design [20] by dividing them intothree groups each group was exposed to one of the three de-bugging aids Prior to the study we asked students to self-report their Java programming experience and familiaritywith Eclipse on a 5-point Likert scale We used this informa-tion to evenly divide them into three groups with membershaving similar levels of expertise Finally the LowQ and

Table 2 The two generated patches for each bug The middle column shows the patchesrsquo acceptability ranking reported byKim et al [17] The patch with a higher ranking (or lower bar) is labeled as high-quality and its counterpart for the same bugis labeled as low-quality

Bug High-quality patch Rank Low-quality patch

-280 public static double [] bracket (

double lowerBound double upperBound )if (fa fb gt 00 || fa fb gt= 00 )

throw new ConvergenceException(

High-Q

public static double [] bracket (double lowerBound double upperBound )if (function == null )

throw MathRuntimeExceptioncreateIllegalArgumentException(

14493 if (lhs == undefined)

if(getShort(iCode pc + 1) lt stringslength ampampgetShort(iCode pc + 1) gt=0) lhs = strings[getShort(iCode pc +

High-Q

if (lhs == undefined) lhs = (( Scriptable)lhs)getDefaultValue

(null)

private void visitRegularCall(Node node inttype Node child boolean firstArgDone)

String simpleCallName = nullif (type = TokenStreamNEW)

if ( firstArgDone)simpleCallName =

getSimpleCallName(node) 0

High-Q

String simpleCallName = nullvisitStatement(node)

17379 for (int i = 0 i lt parenCount i++)

SubString sub =(SubString) parensget(i)

if (sub = null) args[i+1] = subtoString ()

High-Q

for (int i = 0 i lt parenCount i++) SubString sub =

(SubString) parensget(i)args[parenCount + 1] = new Integer(reImpl

leftContextlength)

6683 for (int i = num i lt stateparenCount i++)

if( state = null ampampstateparens[i] = null)

stateparens[i] length = 0

stateparenCount = num

High-Q

for (int i = num i lt stateparenCount i++)

HighQ groups each had 15 participants and the Locationgroup had 14 participants

The graduate students used the Eclipse IDE (Figure 2a)We piloted the study with 9 CS undergraduate students andrecorded their entire Eclipse usage with screen-capturingsoftware From this pilot study we determined that twohours should be adequate for participants to complete allfive tasks at a reasonably comfortable pace The formalstudy session was supervised by one of the authors and twoother helpers At the beginning of the session we gave a10-minute tutorial introducing the tasks Then within amaximum of two hours participants completed the five de-bugging tasks in their preferred order

Online setting applies to the software engineers andMTurk participants who are geographically unavailable Wecreated a web-based online debugging system3 that pro-vided similar features to the Eclipse workbench (Figure 2b)Unlike Grad it was unlikely to determine beforehand thetotal number of these online participants and their exper-tise Hence to minimize individual differences we adopted

3httppishoncseusthkuserstudy

a within-group design such that participants can be exposedto different debugging aids [20] To balance the experimen-tal conditions we assigned the type of aids to each selectedbug in a round-robin fashion such that each aid was equallylikely to be given to each bug

We piloted the study with 4 CS graduate students Theywere asked to use our debugging system and report any en-countered problems We resolved these problems such asbrowser incompatibility before formally inviting softwareengineers and MTurk users

An exit survey is administered to all the participantsupon their task completion In the survey we asked theparticipants to rate the difficulty of each bug and the helpful-ness of the provided aids on a 5-point Likert scale Also weasked them to share their opinions on using auto-generatedpatches in a free-textual form We asked Engr and MTurkparticipants to self-report their Java programming experi-ence and debugging time for each task since the online sys-tem cannot monitor the event when they were away fromthe keyboard (discussed in Section 5) In addition we askedMTurk participants to report their occupations

We summarize our experimental settings in Table 3

(a) The Eclipse workbench

(b) The online debugging system

Figure 2 The onsite and online debugging environmentswhich allow participants to browse a bugrsquos description (A)edit the buggy source code (B) view the debugging aids (C)run the test cases (D) and submit their patches (E)

24 Evaluation MeasuresWe measure the participantsrsquo patch correctness debug-

ging time and survey feedback as described belowCorrectness indicates whether a patch submitted by a

participant is correct or not We considered a patch correctonly if it passed all the given test cases and also matchedthe semantics of the original developersrsquo fixes as described inTable 1 Two of the authors and one CS graduate studentwith 8 years of Java programming experience individuallyevaluated all patches Initially the three evaluators agreedon 283 of 337 patches The disagreement mainly resultedfrom misunderstanding of patch semantics After clarifica-tion the three evaluators carried out another round of in-dividual evaluations and this time agreed on 326 out of 337patches This was considered to be near-perfect agreementas its Fleiss kappa value equals to 0956 [11] For the re-maining 11 patches the majorityrsquos opinion (ie two of thethree evaluators) was adopted as the final evaluation result

Debugging time was recorded automatically For theonsite session we created an Eclipse plug-in to sum up theelapsed time of all the activities (eg opening a file mod-ifying the code and running the tests) related to each bugFor online sessions our system computes participantsrsquo de-bugging time as the time elapsed from them entering thewebsite until the patch submission

Feedback was collected from the exit survey (Section 23)We particularly analyzed participantsrsquo perception about thehelpfulness of generated patches and their free-form answersThese two items allowed us to quantitatively and quali-tatively evaluate participantsrsquo opinions on using generatedpatches as debugging aids (Section 31 and Section 4)

Table 3 The study settings and designs for different typesof participants along with their population size and numberof submitted patches In cases where participants skippedor gave up some tasks we excluded patches with no modi-fication to the original buggy code and finally collected 337patches for analysis

Settings Designs Size Pa-tches

Grad Onsite (Eclipse) Between-group 44 216Engr Online (website) Within-group 28 68MTurk Online (website) Within-group 23 53

Total 95 337

3 RESULTSWe report participantsrsquo debugging performance with re-

spect to different debugging aids (Section 31) We theninvestigate debugging performance on different participanttypes (Section 32) bug difficulties (Section 33) and pro-gramming experience (Section 34) We use regression mod-els to statistically analyze the relations between these fac-tors and the debugging performance (Section 35) and finallysummarize our observations (Section 36)

31 Debugging AidsAmong the 337 patches we collected 109 112 and 116 were

aided by buggy locations LowQ and HighQ patches respec-tively Figure 3 shows an overview of the results in terms ofthe three evaluation measures

Correctness (Figure 3a) Patches submitted by the Lo-cation group is 48 (52109) correct The percentage ofcorrect patches dramatically increases to 71 (82116) forthe HighQ group LowQ patches do not improve debug-ging correctness On the contrary the LowQ group performseven worse than the Location group with only 33 (37112)patches being correct

Debugging Time (Figure 3b) The HighQ group has anaverage debugging time of 147 minutes which is slightlyfaster than the 173 and 163 minutes of the Location andLowQ groups

Feedback (Figure 3c) HighQ patches are consideredmuch more helpful for debugging compared to LowQ patchesMann-Whitney U test suggests that this difference is statis-tically significant (p-value = 00006 lt 005) To understandthe reasons behind this result we further analyze partici-pantsrsquo textual answers and discuss the details in Section 4

32 Participant TypesOur three types of participants Engr Grad and MTurk

have different demographics and are exposed to different ex-perimental settings For this reason we investigate the de-bugging performance by participant types

Figure 5a shows a consistent trend across all three partic-ipant types the HighQ group makes the highest percentageof correct patches followed by the Location group while theLowQ group makes the lowest Note that Grad has a lowercorrectness compared to Engr and MTurk possibly due totheir relatively short programming experience

Figure 5b reveals that Engr debugs faster than Grad andMTurk participants Yet among the participants from thesame population their debugging time does not vary toomuch between the Location LowQ and HighQ groups

Location LowQ HighQ(a)

Location LowQ HighQ(b)

1(Not helpful)

2(Slightly helpful)

3(Medium)

4(Helpful)

5(Very helpful)

LowQ HighQ(c)

Figure 3 (a) the percentage of correct patches by different debugging aids (b) debugging time by different aids (c) partici-pantsrsquo perception about the helpfulness of LowQ and HighQ patches

33 Bug DifficultySince the five bugs vary in symptoms and defect types

they possibly pose different levels of challenges to the partic-ipants Research also shows that task difficulty potentiallycorrelates to usersrsquo perception about automated diagnosticaids [23] For this reason we investigate participantsrsquo de-bugging performance with respect to different bugs Weobtained the difficulty ratings of bugs from the exit sur-vey Note that we adopt the difficulty ratings from only theLocation group since other participantsrsquo perceptions couldhave been influenced by the presence of generated patchesAs shown in Figure 4 Rhino-192226 is considered to be themost difficult bug while Rhino-217379 is considered to be theeasiest In particular ANOVA with Tukey HSD post-hoctest [20] shows that the difficulty rating of Rhino-1922264 issignificantly higher than that of the remaining four bugs

Figure 5c shows that for bug3 the HighQ group made aldquolandslide victoryrdquo with 70 of the submitted patches beingcorrect while no one from the Location or the LowQ groupwas able to fix it correctly One primary reason could bethe complex nature of this bug which is caused by a miss-ing if construct (Table 2) This type of bug also knownas the code omission fault can be hard to identify [10][14]Fry and Weimer also empirically showed that the accuracyof human fault localization is relatively low for missing-conditional faults [14] In addition the fix location of thisbug is not the statement that throws the exception but fivelines above it The code itself is also complex for involvinglexical parsing and dynamic compilation Hence both theLocation and LowQ groups stumbled on this bug Howeverthe HighQ group indeed benefited from the generated patchas one participant explained

ldquoWithout deep knowledge of the data structure represent-ing the code being optimized the problem would have takenquite a long time to determine by manually The patchprecisely points to the problem and provided a good hintand my debugging time was cut down to about 20 min-utesrdquo

For bug4 interestingly the Location group performs bet-ter than the LowQ and even the HighQ group (Figure 5c)One possible explanation is that the code of this bug re-quires less project-specific knowledge (Table 2) and its fix

4For simplicity in the remaining of the paper we refer toeach bug using its order appearing in Table 1

1Very easy

3Medium

4Difficult

5Very difficult

Mathminus280 Bug2

Rhinominus114493 Bug3

Rhinominus76683

Figure 4 The average difficulty rating for each bug

by adding a null checker is a common pattern [17] There-fore even without the aid of generated patches participantswere likely to resolve this bug solely by themselves and theyindeed considered it to be the easiest bug (Figure 4) Forbug5 the LowQ patch deletes the entire buggy block insidethe for-loop (Table 2) which might raise suspicion for beingtoo straightforward This possibly explains why the LowQand Location groups have the same correctness for this bug

In general the HighQ group has made the highest percent-age of correct patches for all bugs except bug4 The LowQgroup on the other hand has comparable or even lower cor-rectness than the Location group As for debugging timeparticipants in general are slower for the first three bugs asshown in Figure 5d We speculate that this also relates tothe difficulty of the bugs as these three bugs are consideredto be more difficult than the other two (Figure 4)

34 Programming ExperienceResearch has found that expert and novice developers

are different in terms of debugging strategies and perfor-mance [18] In this study we also explore whether the use-fulness of generated patches as debugging aids is affected byparticipantsrsquo programming experience

Among all 95 participants 72 reported their Java pro-gramming experience They have up to 14 and on average44 years of Java programming experience Accordingly wedivided these 72 participants into two groups mdash experts withabove average and novices with below average programmingexperience

As shown in Figure 5e experts are 60 correct when aidedby Location This number increases to 76 when they areaided by HighQ patches and decreases to 45 when they

Engr Grad MTurk(a)

Bug1 Bug2 Bug3 Bug4 Bug5(c)

Experts Novices(e)

Debugging aidLocationLowQHighQ

Engr Grad MTurk(b)

Bug1 Bug2 Bug3 Bug4 Bug5(d)

Experts Novices(f)

Figure 5 Percentage of correct patches and debugging time by participants bugs and programming experience

are aided by LowQ patches Novices manifest a similaryet sharper trend their percentage of correct patches in-creases substantially from 44 to 74 when they are aidedby Location and HighQ patches respectively Interestinglywhen aided by HighQ patches the debugging correctnessof novices is almost as good as that of the experts Fig-ure 5f shows that novices also spend less time debuggingwith HighQ patches than with Location and LowQ patches

35 Regression AnalysisPrior subsections observe that changes in debugging cor-

rectness and time are attributable to debugging aids bugsparticipant types programming experience and their jointeffects To quantify the effects of these factors on the de-bugging performance we perform multiple regression anal-ysis Multiple regression analysis is a statistical process forquantifying relations between a dependent variable and anumber of independent variables [12] which fits naturally toour problem Multiple regression analysis also quantifies thestatistical significance of the estimated relations [12] Thistype of analysis has been applied in other similar researchfor example to analyze how configurational characteristicsof software teams affect project productivity [4][5][31]

Debugging aids participant types and bugs are all cate-gorical variables with 3 3 5 different values respectivelySince categorical variables cannot be directly used in regres-sion models we apply dummy coding to transform a cate-gorical variable with k values into kminus1 binary variables [12]For example for the debugging aids we create two binaryvariables HighQ and LowQ one of them being 1 if the cor-responding patch is given When both variables are 0 theLocation aid is given and serves as the basis to which theother two categories are compared In the same manner wedummy-code participant types and bugs with ldquoGradrdquo andldquobug4rdquo as the basis categories5 Below shows the equationsthat form our regression models Since patch correctness

5The selection of basis category is essentially arbitrary anddoes not affect the regression results [12]

is binary data while debugging time is continuous data weestimate equation 1 using logistic regression and equation 2using linear regression [12] Note that we log-transform de-bugging time to ensure its normality [12]

logit(correctness) = α0 + α1 lowastHighQ + α2 lowast LowQ

+α3 lowast Engr + α4 lowastMTurk + α5 lowast Bug1

+α6 lowast Bug2 + α7 lowast Bug3

+α8 lowast Bug5 + α9 lowast JavaYears

ln(time) = β0 + β1 lowastHighQ + β2 lowast LowQ + β3 lowast Engr

+β4 lowastMTurk + β5 lowast Bug1 + β6 lowast Bug2

+β7 lowast Bug3 + β8 lowast Bug5 + β9 lowast JavaYears

Table 4 shows the regression results with the first num-ber as the variable coefficient and the second number in theparenthesis as its p-value In general a positive coefficientmeans the variable (row name) improves the debugging per-formance (column name) compared to the basis categorywhile a negative coefficient means the opposite Since thelogistic regression computes the log of odds for correctnessand we log-transform debugging time in the linear regres-sion both computed coefficients need to be ldquoanti-loggedrdquo tointerpret the results We highlight our findings below

HighQ patches significantly improve debugging co-rrectness with coef=125gt0 and p-value=000lt005 Toput this into perspective holding other variables fixed theodds of the HighQ group making correct patches is e125=35times that of the Location group The LowQ group on theother hand is less likely to make correct patches thoughthis trend is not significant (coef= -055 p=009) The aidof HighQ (coef=-019 p=008) and LowQ patches (coef=-007p=056) both slightly reduce the debugging time butnot significantly

Difficult bugs significantly slow down debuggingas bug123 all have significant positive coefficients or longerdebugging time For the most difficult bug3 its debugging

Table 4 Regression results Arrows indicate the direction ofimpact on debugging performance Double arrows indicatethat the coefficient is statistically significant at 5 level

Correctness Debugging Timecoef (p-value) coef (p-value)

HighQ 125 (000) uArr -019 (008) darrLowQ -055 (009) darr -007 (056) darrBug1 -057 (018) darr 096 (000) uArrBug2 -050 (022) darr 056 (000) uArrBug3 -209 (000) dArr 059 (000) uArrBug5 -059 (016) darr -005 (075) darrEngr 103 (002) uArr -125 (000) dArr

MTurk 090 (002) uArr -016 (020) darrJava years 012 (002) uArr -003 (011) darr

time is e059=18 times that of bug4 yet the odds of correctlyfixing it is only eminus209asymp18 of the odds of correctly fixingbug4

Participant type and experience significantly af-fect debugging performance compared to Grad Engrand MTurk are much more likely to make correct patches(coef=103 and 090 with p=002) Engr is also significantlyfaster spending approximately eminus125asymp13 time debuggingAlso more Java programming experience significantly im-proves debugging correctness (coef=012 p=002)

Our regression analysis thus far investigates the individualimpact of debugging aids and other variables on the debug-ging performance However the impact of debugging aidsalso possibly depends on the participants who use the aidsor the bugs to be fixed To explore such joint effects weadd interaction variables to the regression model and useANOVA to analyze their explanatory powers [12] Beloware our findings

HighQ patches are more useful for difficult bugsas shown in Table 5 in addition to the four single variablesthe debugging aids and bugs also jointly explain large vari-ances in the patch correctness (deviance reduction=2554p=000) Hence we add this interaction variable AidBugto the logistic regression for estimating correctness Table 6shows that HighQBug3 has a relatively large positive co-efficient (coef=1869) indicating that the HighQ patch isparticularly useful for fixing this bug

The type of aid does not affect debugging timenone of the interaction variables significantly contributes tothe model (Table 5) Interestingly neither does the debug-ging aid itself which explains only 1 (R2=001) of thevariance in the debugging time Instead debugging time ismuch more susceptible to participant types and bugs whichfurther explain 20 and 16 of its variance respectively

These regression results are consistent with those observedfrom Figures 3 4 and 5 in the preceding subsections Due tospace limitation we elaborate more details of the regressionanalysis at httpwwwcseusthk˜idagooautofixregression

analysishtml which reports the descriptive statistics and cor-relation matrix for the independent variables the hierarchi-cal regression results and how Table 4 5 and 6 are derived

36 SummaryBased on our results we draw the following conclusions

bull High-quality generated patches significantly improvedebugging correctness and are particularly beneficial

Table 5 ANOVA for the regression models with interactionvariables It shows how much variance in correctnessdebug-ging time is explained as each (interaction) variable is addedto the model The p-value in parenthesis indicates whetheradding the variable significantly improves the model

Correctness Debugging TimeVariable (resid deviance R2 (variance

reductions) explained)

DebuggingAid 3280 (000) 001 (006)Participant 1685 (000) 020 (000)

Bug 2191 (000) 016 (000)JavaYears 613 (001) 001 (011)

AidParticipant 295 (057) 002 (006)AidBug 2554 (000) 002 (026)

AidJavaYears 118 (055) 000 (082)

Table 6 Coefficients of the interaction variable De-bugAidBug for estimating patch correctness The boldvalue is statistically significant at 5 level

Bug1 Bug2 Bug3 Bug4 Bug5HighQ 042 186 1869 -045 099LowQ -039 -024 012 -176 -009

for difficult bugs Low-quality generated patches sligh-tly undermine debugging correctness

bull Participantsrsquo debugging time is not affected by the de-bugging aids they use However their debugging takesa significantly longer time for difficult bugs

bull Participant type and experience also significantly af-fect debugging performance Engr MTurk or the par-ticipants with more Java programming experience aremore likely to make correct patches

4 QUALITATIVE ANALYSISWe used the survey feedback to qualitatively investigate

participantsrsquo opinions on using auto-generated patches in de-bugging In total we received 71 textual answers After anopen coding phase [20] we identified twelve common reasonswhy participants are positive or negative about using auto-generated patches as debugging aids Table 7 summarizesthese reasons along with the participantsrsquo original answers

Participants from both the HighQ and LowQ groups ac-knowledge generated patches to provide quick starting pointsby pinpointing the general buggy area (P1-P3) HighQ patc-hes in particular simplify debugging and speed up the debug-ging process (P4-P6) However a quick starting point doesnot guarantee that the debugging is going down the rightpath since generated patches can be confusing and mislead-ing (N1-N2) Even HighQ patches may not completely solvethe problem and thus need further human perfection (N3)

Participants regardless of using HighQ or LowQ patcheshave a general concern that generated patches may over-complicate debugging or over-simplify it by not address-ing root causes (N4-N5) In other words participants areconcerned about whether machines are making random oreducated guesses when generating patches (N6) Anothershared concern is that the usage of generated patches shouldbe based on a good understanding to the target programs(N7-N8)

Table 7 Participantsrsquo positive and negative opinions on using auto-generated patches as debugging aids All of these sentencesare the participantsrsquo original feedback We organize them into different categories with the summary written in italic

Positive Negative

It provides a quick starting point

P1 It usually provides a good starting point (HighQ)P2 It did point to the general area of code (LowQ)P3 I have followed the patch to identify the exact placewhere the issue occurred (LowQ)

It can be confusing misleading or incompleteN1 The generated patch was confusing (LowQ)N2 For developers with less experience it can give them thewrong lead (LowQ)N3 Itrsquos not a complete solution The patch tested for two nullreferences however one of the two variables was dereferencedabout 1-2 lines above the faulting line Therefore there shouldhave been two null reference tests (HighQ)

It simplifies the problem

P4 I feel that the hints were quite good and sim-plified the problem (HighQ)P5 The hint provided the solution to the problem thatwould have taken quite a long time to determine bymanually (HighQ)P6 This saves time when it comes to bugs and memorymonitoring (HighQ)

It may over-complicate the problem or over-simplify itby not addressing the root causeN4 Sometimes they might over-complicate the problem or notaddress the root of the issue (HighQ)N5 The patch did not appear to catch the issue and insteadlooked like it deleted the block of code (LowQ)N6Generated patches are only useful when the machine doesnot do it randomly eg it cannot just guess that somethingmay fix it and if it does that that is good (LowQ)

It helps brainstorming

P7 They would seem to be useful in helping findvarious ideas around fixing the issue even if the patchisnrsquot always correct on its own (LowQ)

It may not be helpful for unfamiliar code

N7 I prefer to fix the bug manually with a deeper un-derstanding of the program (HighQ)N8 A context to the problem may have helped here (LowQ)

It recognizes easy problems

P8 They would be good at recognizing obviousproblems (eg potential NPE) (HighQ)

It may not recognize complicated problems

N9 They would be difficult to recognize more involveddefects (HighQ)

It makes tests pass

P9 The patch made the test case pass (LowQ)

It may not work if the test suite itself is insufficientN10 Assuming the test cases are sufficient the patch willwork (HighQ)N11 Irsquom worried that other test cases will fail (LowQ)

It provides extra diagnostic information

P10 Any additional information regarding a bugis almost always helpful (HighQ)

It cannot replace standard debugging tools

N12 I would use them as debugging aids along withaccess to standard debugging tools (LowQ)

Interestingly we also observe several opposite opinionsParticipants think that generated patches may be good atrecognizing easy bugs (P8) but may not work well for com-plicated ones (N9) While generated patches can quicklypass test cases (P9) they work only if test cases are sufficient(N10-N11) Finally participants use generated patches tobrainstorm (P7) and acquire additional diagnostic informa-tion (P10) However they still consider generated patchesdispensable especially when powerful debugging tools areavailable (N12)

In general participants are positive about auto-generatedpatches as they provide quick hints about either buggy ar-eas or even fix solutions On the other hand participantsare less confident about the capability of generated patchesfor handling complicated bugs and addressing root causesespecially when they are unfamiliar with code and test cases

5 THREATS TO VALIDITYThe five bugs and ten generated patches we used to con-

struct debugging tasks may not be representative of all bugsand generated patches We mitigate this threat by selectingreal bugs with varied symptoms and defect types and using

patches that were generated by two state-of-the-art programrepair techniques [17][34] Nonetheless further studies withwider coverage of bugs and auto-generated patches are re-quired to strengthen the generalizability of our findings

We measured patch quality using human-perceived ac-ceptability which may not generalize to other measurementssuch as metric-based ones [27] However patch quality isa rather broad concept and research has suggested thatthere is no easy way to definitively describe patch qual-ity [13][24] Another threat regarding patch quality is thatwe labeled a generated patch as ldquoLowQrdquo relative to its com-peting ldquoHighQrdquo patch in this study We plan to work on amore general solution by establishing a quality baseline forexample the buggy location as in the control group and dis-tinguishing the quality of auto-generated patches accordingto this baseline

To avoid repeated testing [20] we did not invite domainexperts who were familiar with the buggy code since itwould be hard to judge whether their good performance isattributable to debugging aids or their a priori knowledge ofthe corresponding buggy code Instead we recruited partic-ipants who had little or no domain knowledge of the codeThis is not an unrealistic setting since in practice devel-

opers are often required to work with unfamiliar code dueto limited human resources or deadline pressure [9][19][25]However our findings may not generalize to project devel-opers who are sufficiently knowledgeable about the codeFuture studies are required to explore how developers makeuse of auto-generated patches to debug their familiar code

Our manual evaluation of the participantsrsquo patches mightbe biased We mitigate this threat by considering patchesverified by the original project developers as the groundtruth Three human evaluators individually evaluated thepatches for two iterations and reached a consensus with theFleiss kappa equals to 0956 (Section 24)

Our measuring of debugging time might be inaccuratefor Engr and MTurk participants since the online systemcannot remotely monitor participants when they were awayfrom keyboard during debugging To mitigate this threatwe asked these participants to self-report the time they hadspent on each task and found no significant difference be-tween their estimated time and our recorded time

One may argue that participants might have blindly reusedthe auto-generated patches instead of truly fixing bugs ontheir own We took several preventive measures to discour-age such behaviors First we emphasized in the instructionsthat the provided patches may or may not really fix the bugsand participants should make their own judgement Secondwe claimed to preserve additional test cases by ourselves sothat participants may not relax even if they passed all thetests by reusing the generated patches We also requiredparticipants to justify their patches during submissions Asdiscussed in Section 3 the participants aided by generatedpatches spent a similar amount of debugging time to theLocation group This indicates that participants were lesslikely to reuse generated patches directly which would oth-erwise take only seconds to complete

6 RELATED WORKAutomatic patch generation techniques have been actively

studied Most studies evaluated the quantitative aspects ofproposed techniques such as the number of candidates gen-erated before a valid patch is found and the number of sub-jects for which patches can be generated [8][21][26][30][33][34] Our study complements prior studies by qualitativelyevaluating the usefulness of auto-generated patches as de-bugging aids

Several studies have discussed the quality of generatedpatches Kim et al evaluated generated patches using hu-man acceptability [17] which is adopted in our study to in-dicate patch quality Fry et al considered patch quality asits understandability and maintainability [13] Their humanstudy revealed that generated patches alone are less main-tainable than human-written patches However when aug-mented with synthesized documentation generated patcheshave comparable or even better maintainability than human-written patches [13] Fry et al also investigated code fea-tures (eg number of conditionals) that correlate to patchmaintainability [13]

While patch quality can be measured from various per-spectives research has suggested that instead of using a uni-versal definition the measurement of patch quality shoulddepend on the application [13][24][27] Monperrus suggestedthat generated patches should be understandable by humanswhen used in recommendation systems [24] We used essen-tially the same experimental setting in which participants

were recommended but not forced to use generated patchesfor debugging Our results consistently show that patch un-derstandability affects the recommendation outcome

Automated support for debugging is another line of re-lated work Parnin and Orso conducted a controlled ex-periment requiring human subjects to debug with or with-out the support of an automatic fault localization tool [28]Contrary to their finding that the tool did not help performdifficult tasks we found that generated patches tend to bemore beneficial for fixing difficult bugs

Ceccato et al conducted controlled experiments in whichhumans performed debugging tasks using manually writtenor automatically generated test cases [7] They discoveredthat despite the lack of readability human debugging per-formance improved significantly with auto-generated testcases However generated test cases are essentially dif-ferent from the generated patches used in our study Forinstance generated test cases include sequences of methodcalls that may expose buggy program behaviors while gen-erated patches directly suggest bug fixes Such differencesmight cause profound deviations in human debugging be-haviors In fact contrary to the findings of Ceccato et alwe observed that the participantsrsquo debugging performancedropped when given low-quality generated patches

7 CONCLUSIONWe conducted a large-scale human study to investigate the

usefulness of automatically generated patches as debuggingaids We discover that the participants do fix bugs morequickly and correctly using high-quality generated patchesHowever when aided by low-quality generated patches theparticipantsrsquo debugging correctness is compromised

Based on the study results we have learned importantaspects to consider when using auto-generated patches indebugging First of all strict quality control of generatedpatches is required even if they are used as debugging aidsinstead of direct integration into the production code Oth-erwise low-quality generated patches may negatively affectdevelopersrsquo debugging capabilities which might in turn jeop-ardize the overall software quality Second to maximize theusefulness of generated patches proper tasks should be se-lected For instance generated patches can come in handywhen fixing difficult bugs For easy bugs however simpleinformation such as buggy locations might suffice Alsousersrsquo development experience matters as our result showsthat novices with less programming experience tend to ben-efit more from using generated patches

To corroborate these findings we need to extend our studyto cover more diverse bug types and generated patches Wealso need to investigate the usefulness of generated patchesin debugging with a wider population such as domain ex-perts who are sufficiently knowledgeable of the buggy code

8 ACKNOWLEDGMENTSWe thank all the participants for their help with the hu-

man study We thank Fuxiang Chen and Gyehyung Jeon fortheir help with the result analysis This work was funded byResearch Grants Council (General Research Fund 613911) ofHong Kong National High-tech RampD 863 Program (2012AA-011205) and National Natural Science Foundation (6110003891318301 61321491 61361120097) of China

9 REFERENCES[1] Amazon mechanical turk

httpswwwmturkcommturk

[2] J Anvik L Hiew and G C Murphy Coping with anopen bug repository In Proceedings of the 2005OOPSLA workshop on Eclipse technology eXchange

[3] J Aranda and G Venolia The secret life of bugsGoing past the errors and omissions in softwarerepositories ICSErsquo09

[4] C Bird N Nagappan P Devanbu H Gall andB Murphy Does distributed development affectsoftware quality an empirical case study of windowsvista ICSE rsquo09

[5] C Bird N Nagappan B Murphy H Gall andP Devanbu Donrsquot touch my code Examining theeffects of ownership on software quality ESECFSErsquo11 2011

[6] A Carzaniga A Gorla A Mattavelli N Perino andM Pezze Automatic recovery from runtime failuresIn Proceedings of the 2013 International Conferenceon Software Engineering ICSE rsquo13

[7] M Ceccato A Marchetto L Mariani C Nguyenand P Tonella An empirical study about theeffectiveness of debugging when random test cases areused In ICSErsquo12 pages 452ndash462

[8] V Dallmeier A Zeller and B Meyer Generatingfixes from object behavior anomalies In Proceedings ofthe 2009 IEEEACM International Conference onAutomated Software Engineering ASErsquo09

[9] E Duala-Ekoko and M P Robillard Asking andanswering questions about unfamiliar APIs anexploratory study ICSErsquo12 pages 266ndash276

[10] J Duraes and H Madeira Emulation of softwarefaults A field data study and a practical approachSoftware Engineering IEEE Transactions on32(11)849ndash867 2006

[11] J L Fleiss Measuring nominal scale agreement amongmany raters Psychological bulletin 76(5)378 1971

[12] J Fox Applied regression analysis linear models andrelated methods Sage 1997

[13] Z P Fry B Landau and W Weimer A human studyof patch maintainability In Proceedings of the 2012International Symposium on Software Testing andAnalysis ISSTArsquo12

[14] Z P Fry and W Weimer A human study of faultlocalization accuracy ICSMrsquo10

[15] Z Gu E T Barr D J Hamilton and Z Su Has thebug really been fixed ICSErsquo10

[16] JNielsen Time budgets for usability sessionshttpwwwuseitcomalertboxusability sessionshtml

[17] D Kim J Nam J Song and S Kim Automaticpatch generation learned from human-written patchesIn Proceedings of the 2013 International Conferenceon Software Engineering ICSErsquo13

[18] A Ko and B Uttl Individual differences in programcomprehension strategies in unfamiliar programmingsystems In IWPCrsquo03

[19] A J Ko B A Myers M J Coblenz and H HAung An Exploratory Study of How Developers SeekRelate and Collect Relevant Information duringSoftware Maintenance Tasks IEEE Transactions on

Software Engineering 32971ndash987 2006

[20] J Lazar J H Feng and H Hochheiser ResearchMethods in Human-Computer Interaction John WileySons 2010

[21] C Le Goues M Dewey-Vogt S Forrest andW Weimer A systematic study of automated programrepair fixing 55 out of 105 bugs for $8 each ICSErsquo12

[22] C Le Goues S Forrest and W Weimer Currentchallenges in automatic software repair SoftwareQuality Journal pages 1ndash23 2013

[23] P Madhavan D A Wiegmann and F C LacsonAutomation failures on tasks easily performed byoperators undermine trust in automated aids HumanFactors The Journal of the Human Factors andErgonomics Society 48(2)241ndash256 2006

[24] M Monperrus A critical review of Sautomatic patchgeneration learned from human-written patchesT Anessay on the problem statement and the evaluation ofautomatic software repair ICSErsquo14

[25] T H Ng S C Cheung W K Chan and Y T YuWork experience versus refactoring to design patternsa controlled experiment In Proceedings of the 14thACM SIGSOFT international symposium onFoundations of software engineering SIGSOFTrsquo06FSE-14 pages 12ndash22 2006

[26] H D T Nguyen D Qi A Roychoudhury andS Chandra SemFix program repair via semanticanalysis In Proceedings of the 2013 InternationalConference on Software Engineering ICSErsquo13

[27] K Nishizono S Morisaki R Vivanco andK Matsumoto Source code comprehension strategiesand metrics to predict comprehension effort insoftware maintenance and evolution tasks - anempirical study with industry practitioners InICSMrsquo11

[28] C Parnin and A Orso Are automated debuggingtechniques actually helping programmers InProceedings of the 2011 International Symposium onSoftware Testing and Analysis ISSTArsquo11

[29] J H Perkins S Kim S Larsen S AmarasingheJ Bachrach M Carbin C Pacheco F SherwoodS Sidiroglou G Sullivan W-F Wong Y ZibinM D Ernst and M Rinard Automatically patchingerrors in deployed software SOSPrsquo09

[30] Y Qi X Mao Y Lei and C Wang Using automatedprogram repair for evaluating the effectiveness of faultlocalization techniques ISSTA rsquo13 2013

[31] N Ramasubbu M Cataldo R K Balan and J DHerbsleb Configuring global software teams Amulti-company analysis of project productivityquality and profits ICSE rsquo11 pages 261ndash270 2011

[32] J Ross L Irani M S Silberman A Zaldivar andB Tomlinson Who are the crowdworkers shiftingdemographics in mechanical turk CHI EA rsquo10

[33] Y Wei Y Pei C A Furia L S Silva S BuchholzB Meyer and A Zeller Automated fixing ofprograms with contracts ISSTA rsquo10

[34] W Weimer T Nguyen C Le Goues and S ForrestAutomatically finding patches using geneticprogramming ICSE rsquo09

Introduction

Experimental Process

Bug Selection and Debugging Aids

Participants

Study Design

Evaluation Measures

Results

Debugging Aids

Participant Types

Bug Difficulty

Programming Experience

Regression Analysis

Summary

Qualitative Analysis

Threats to Validity

Related Work

Conclusion

Acknowledgments

References