33
EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

Embed Size (px)

Citation preview

Page 1: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

1

EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR

CLAIRE LE GOUES

SITE VISIT

FEBRUARY 7, 2013

Page 2: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

2

“Benchmarks set standards for innovation, and can encourage or stifle it.”

-Blackburn et al.

Page 3: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

3

2009: 15 papers on automatic program repair*

2011: Dagstuhl seminar on self-repairing programs

2012: 30 papers on automatic program repair*

2013: dedicated program repair track at ICSE

*manually reviewed the results of a search of the ACM digital library for “automatic program repair”

AUTOMATIC PROGRAM REPAIR OVER TIME

Page 4: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

4

Manually sift through bugtraq data.

Indicative example: Axis project for automatically repairing concurrency bugs

• 9 weeks of sifting to find 8 bugs to study.• Direct quote from Charles Zhang, senior author, on the

process: "it's very painful”

Very difficult to compare against previous or related work or generate sufficiently large datasets.

CURRENT APPROACH

Page 5: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

5

GOAL: HIGH-QUALITY EMPIRICAL EVALUATION

Page 6: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

6

SUBGOAL: HIGH-QUALITY BENCHMARK SUITE

Page 7: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

7

Indicative of important real-world bugs, found systematically in open-source programs.

Support a variety of research objectives.

• “Latitudinal” studies: many different types of bugs and programs

• “Longitudinal” studies: many iterative bugs in one program.

Scientifically meaningful: passing test cases repair

Admit push-button, simple integration with tools like GenProg.

BENCHMARK REQUIREMENTS

Page 8: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

8

Indicative of important real-world bugs, found systematically in open-source programs.

Support a variety of research objectives.

• “Latitudinal” studies: many different types of bugs and programs

• “Longitudinal” studies: many iterative bugs in one program.

Scientifically meaningful: passing test cases repair

Admit push-button, simple integration with tools like GenProg.

BENCHMARK REQUIREMENTS

Page 9: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

http://genprog.cs.virginia.edu 9

Goal: a large set of important, reproducible bugs in non-trivial programs.

Approach: use historical data to approximate discovery and repair of bugs in the wild.

SYSTEMATIC BENCHMARK SELECTION

Page 10: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

10

Indicative of important real-world bugs, found systematically in open-source programs:

• Add new programs to the set, with as wide a variety of types as possible (support “latitudinal” studies)

Support a variety of research objectives:

• Allow studies of iterative bugs, development, and repair: generate a very large (100) set of bugs in one program (php) (support “longitudinal” studies).

NEW BUGS, NEW PROGRAMS

Page 11: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

11

Program LOC Tests Bugs Description

fbc 97,000 773 3 Language (legacy)

gmp 145,000 146 2 Multiple precision math

gzip 491,000 12 5 Data compression

libtiff 77,000 78 24 Image manipulation

lighttpd 62,000 295 9 Web server

php 1,046,000 11,995 100 Language (web)

python 407,000 355 11 Language (general)

wireshark 2,814,000 63 7 Network packet analyzer

valgrind 711,000 595 2 Simulator and debugger

vlc 522,000 17 ?? Media player

svn 629,000 1,748 ?? Source control

Total 7,001,000 16,077 163

Page 12: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

12

Indicative of important real-world bugs, found systematically in open-source programs.

Support a variety of research objectives.

• “Latitudinal” studies: many different types of bugs and programs

• “Longitudinal” studies: many iterative bugs in one program.

Scientifically meaningful: passing test cases repair

Admit push-button, simple integration with tools like GenProg.

BENCHMARK REQUIREMENTS

Page 13: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

13

They must exist.

• Sometimes, but not always, true (see: Jonathan Dorn)

TEST CASE CHALLENGES

Page 14: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

14

Program LOC Tests Bugs Description

fbc 97,000 773 3 Language (legacy)

gmp 145,000 146 2 Multiple precision math

gzip 491,000 12 5 Data compression

libtiff 77,000 78 24 Image manipulation

lighttpd 62,000 295 9 Web server

php 1,046,000 11,995 100 Language (web)

python 407,000 355 11 Language (general)

wireshark 2,814,000 63 7 Network packet analyzer

valgrind 711,000 595 2 Simulator and debugger

Total 5,850,000 14,312 163

BENCHMARKS

Page 15: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

15

They must exist.

• Sometimes, but not always, true (see: Jonathan Dorn)

They should be of high quality.

• This has been a challenge from day 0: nullhttpd• Lincoln labs noticed it too: sort• In both cases, adding test cases led to better repairs.

TEST CASE CHALLENGES

Page 16: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

16

They must exist.

• Sometimes, but not always, true (see: Jonathan Dorn)

They should be of high quality.

• This has been a challenge from day 0: nullhttpd• Lincoln labs noticed it too: sort• In both cases, adding test cases led to better repairs.

They must be automated to run one at a time, programmatically, from within another framework.

TEST CASE CHALLENGES

Page 17: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

17

Need to be able to compile and run new variants programmatically.

Need to be able to run test cases one at a time.

• It’s not simple, and as we scale up to real-world systems, becomes increasingly tricky.

• Much of the challenge is unrelated to the program in question, instead requiring highly-technical knowledge of OS-level details.

PUSH-BUTTON INTEGRATION

Page 18: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

18

Calling a process from within another process :

• system(“run test 1”) ...; wait()

wait() returns the process exit status.

This is complex.

• Example: a system call can fail because the OS ran out of memory in creating the process, or because the process itself ran out of memory.

How do we tell the difference?

• Answer: bit masking

DIGRESSION ON WAIT()

Page 19: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

19

Moral: integration is tricky, and lends itself to human mistakes.

Possibility 1: original programmers make mistakes in developing the test suite.

• Test cases can have bugs, too.

Possibility 2: we (GenProg devs/users) make mistakes in integration.

• A few old php test cases are not to our standards; faulty bitshift math for extracting the return value components.

REAL-WORLD COMPLEXITY

Page 20: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

20

Interested in more, better benchmark design, with easy integration (without gnarly OS details).

• Virtual machines provide one approach.

Need a better definition of “high quality test case” vs. “low quality test case:”

• Can the empty program pass it? • Can every program pass it?• Can the “always crashes” program pass it?

INTEGRATION CONCERNS

Page 21: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

21

Over the past year, we have conducted studies of representation and operators for automatic program repair:

• One-point crossover on patch representation.• Non-uniform mutation operator selection.• Alternative fault localization framework.

Results on the next slide incorporate “all the bells and whistles:”

• Improvements based on those large-scale studies.• Manually confirmed quality of testing framework.

CURRENT REPAIR SUCCESS

Page 22: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

22

CURRENT REPAIR SUCCESS

Program Previous Results Current Results

fbc 1/3 1/3

gmp 1/2 1/2

gzip 1/5 1/5

libtiff 17/24 17/24

lighttpd 5/9 5/9

php 28/44 55/100

python 1/11 2/11

wireshark 1/7 4/7

valgrind --- 1/2

Total 55/105 87/163

Page 23: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

23

TRANSITION

Page 24: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

24

REPAIR TEMPLATES

CLAIRE LE GOUES

SHIRLEY PARK

DARPA SITE VISIT

FEBRUARY 7, 2013

Page 25: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

BIO + CS INTERACTION

25

Page 26: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

Immune response is equally fast for large and small animals.

• Human lung is 100x larger than mouse lung, still finds influenza infections in ~8 hours.

• Successfully balances local search and global response.

Balance between generic and specialized T-cells:

• Rapid response to new pathogens vs. long-term memory of previous infections (cf. vaccines).

IMMUNOLOGY: T-CELLS

26

Page 27: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

27MUTATE

DISCARD

INPUT EVALUATE FITNESS

ACCEPT

OUTPUT

Page 28: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

Tradeoff between generic mutation actions and more specific action templates:

• Generic: INSERT, DELETE, REPLACE• Specific:

if ( != NULL) { <code using >}

AUTOMATIC SOFTWARE REPAIR

28

Page 29: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

29

HYPOTHESIS: GENPROG CAN REPAIR MORE BUGS, AND REPAIR BUGS MORE QUICKLY, IF WE AUGMENT MUTATION ACTIONS WITH

“REPAIR TEMPLATES.”

Page 30: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

30

Insight: Just like T-cells “remember” previous infections, abstract previous fixes to generate new mutations.

Approach:

• Model previous changes using structured documentation.• Cluster a large set of changes by similarity.• Abstract the center of each cluster

Example:

if( < 0)

return 0;

else

<code using >

OPTION 1: PREVIOUS CHANGES

Page 31: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

31

Insight: Looking up things at a library provides people with the best example of what they are looking to reproduce.

Approach:

• Generate static paths through C programs.• Mine API usage patterns from those paths• Abstract the patterns into mutation templates.

Example:

while(it.hasnext())

<code using it.next()>

OPTION 2: EXISTING BEHAVIOR

Page 32: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

32

THIS WORK IS ONGOING.

Page 33: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

33

We are generating a benchmark suite to support GenProg research, integration and tech transfer, and the automatic repair community at large.

Current GenProg results for 12-hour repair scenario: 87/163 (53%) of real-world bugs in dataset.

Repair templates will augment GenProg’s mutation operators to help repair more bugs, and repair bugs more quickly.

CONCLUSIONS