AUTOMATED UNIT TEST GENERATION DURING SOFTWARE...

Preview:

Citation preview

AUTOMATED UNIT TEST GENERATION DURING

SOFTWARE DEVELOPMENT

José Miguel Rojas j.rojas@sheffield.ac.uk

Joint work with Gordon Fraser and Andrea Arcuri

A Controlled Experiment and Think-aloud Observations

ISSTA 2015

“Testing is a widespread validation approach in industry, but it is still largely ad hoc,

expensive, and unpredictably effective.”“Software Testing Research: Achievements, Challenges, Dreams,”

A. Bertolino. Future of Software Engineering. IEEE . 2007.

“Testing is a widespread validation approach in industry, but it is still largely ad hoc,

expensive, and unpredictably effective.”

“Test case generation has a strong impact on the effectiveness and efficiency of testing.”

“…one of the most active research topics in software testing for several decades, resulting in many different approaches and tools.”

“Software Testing Research: Achievements, Challenges, Dreams,” A. Bertolino. Future of Software Engineering. IEEE . 2007.

”An orchestrated survey of methodologies for automated software test case generation,” S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M.B. Cohen, W. Grieskamp, M.

Harman, M.J. Harrold, P. McMinn. J. Systems and Software. Elsevier. 2013.

BACK IN ISSTA 2013…

“Does automated white-box test generation really help software testers?,” G. Fraser, M. Staats, P. McMinn, A. Arcuri and F. Padberg

BACK IN ISSTA 2013…

“Does automated white-box test generation really help software testers?,” G. Fraser, M. Staats, P. McMinn, A. Arcuri and F. Padberg

ARE UNIT TEST GENERATION TOOLS HELPFUL TO DEVELOPERS WHILE THEY ARE CODING?

CODE COVERAGE

CODE COVERAGE

TIME SPENT ONTESTING

CODE COVERAGE

TIME SPENT ONTESTING

IMPLEMENTATION QUALITY

CONTROLLED EXPERIMENT

CONTROLLED EXPERIMENT

Golden Implementation and Test Suite

CONTROLLED EXPERIMENT

Class TemplateGolden Implementation and Test Suite

CONTROLLED EXPERIMENT

Class Template Implementation and Test Suite

Golden Implementation and Test Suite

CONTROLLED EXPERIMENT

Class Template Implementation and Test Suite

Golden Implementation and Test Suite

CONTROLLED EXPERIMENT

Class Template Implementation and Test Suite

Manual

Golden Implementation and Test Suite

CONTROLLED EXPERIMENT

1 hourClass Template Implementation and Test Suite

Manual

Golden Implementation and Test Suite

CONTROLLED EXPERIMENT

1 hourClass Template Implementation and Test Suite

41

Manual

Golden Implementation and Test Suite

CONTROLLED EXPERIMENT

1 hourClass Template Implementation and Test Suite

41

Manual

Golden Implementation and Test Suite

CONTROLLED EXPERIMENT

1 hourClass Template Implementation and Test Suite

41 2

Manual

Golden Implementation and Test Suite

CONTROLLED EXPERIMENT

1 hourClass Template Implementation and Test Suite

41 42

Manual

Golden Implementation and Test Suite

DOES USING EVOSUITE DURING SOFTWARE DEVELOPMENT LEAD TO TEST SUITES WITH HIGHER CODE COVERAGE?

RQ 1

Bran

ch C

over

age

0%

20%

40%

60%

80%

100%

FilterIterator FixedOrderComparator ListPopulation PredicatedMap

50%

26%

57%

39% 41%

83%

38%

63%

Assisted Manual

CODE COVERAGEparticipants’ test suites run on their own implementations

CODE COVERAGETi

mes

Cov

erag

e w

as c

heck

ed

0

2

4

6

8

10

Category AxisFilterIterator FixedOrderComparator ListPopulation PredicatedMap

6.4

4

9.6

65.3

5.9

1.9

9

Assisted Manual

Times coverage was checked

ListPopulation

Bran

ch C

over

age

(%)

0%

25%

50%

75%

100%

Time (min)0 10 20 30 40 50 60

ManualEvoSuiteAssisted

CODE COVERAGEparticipant’s test suites run on their own implementations

ListPopulation

Bran

ch C

over

age

(%)

0%

25%

50%

75%

100%

Time (min)0 10 20 30 40 50 60

ManualEvoSuiteAssisted

CODE COVERAGEparticipant’s test suites run on their own implementations

ListPopulation

Bran

ch C

over

age

(%)

0%

25%

50%

75%

100%

Time (min)0 10 20 30 40 50 60

ManualEvoSuiteAssisted

CODE COVERAGEparticipant’s test suites run on their own implementations

ListPopulation

Bran

ch C

over

age

(%)

0%

25%

50%

75%

100%

Time (min)0 10 20 30 40 50 60

ManualEvoSuiteAssisted

CODE COVERAGEparticipant’s test suites run on their own implementations

Bran

ch C

over

age

0%

20%

40%

60%

80%

100%

FilterIterator FixedOrderComparator ListPopulation PredicatedMap

50%

28%21%

30%

42%37%35%

41%

Assisted Manual

CODE COVERAGEparticipants’ test suites run on golden implementations

FilterIterator

Bran

ch C

over

age

0%25%50%75%

100%

Time (min)0 10 20 30 40 50 60

Assisted Manual EvoSuite-generated

CODE COVERAGEparticipant’s test suites run on golden implementations, over time

ListPopulation

Bran

ch C

over

age

0%25%50%75%

100%

Time (min)0 10 20 30 40 50 60

FixedOrderComparator

Bran

ch C

over

age

0%25%50%75%

100%

Time (min)0 10 20 30 40 50 60

PredicatedMapBr

anch

Cov

erag

e

0%25%50%75%

100%

Time (min)0 10 20 30 40 50 60

FilterIterator

Bran

ch C

over

age

0%25%50%75%

100%

Time (min)0 10 20 30 40 50 60

Assisted Manual EvoSuite-generated

CODE COVERAGEparticipant’s test suites run on golden implementations, over time

ListPopulation

Bran

ch C

over

age

0%25%50%75%

100%

Time (min)0 10 20 30 40 50 60

FixedOrderComparator

Bran

ch C

over

age

0%25%50%75%

100%

Time (min)0 10 20 30 40 50 60

PredicatedMapBr

anch

Cov

erag

e

0%25%50%75%

100%

Time (min)0 10 20 30 40 50 60

Coverage can be higher when using EvoSuite, depending on how the generated tests are used.

DOES USING EVOSUITE DURING SOFTWARE DEVELOPMENT LEAD TO

DEVELOPERS SPENDING MORE OR LESS TIME ON TESTING?

RQ 2

TESTING EFFORT

0

3

6

8

11

14

FilterIterator FixedOrderComparator ListPopulation PredicatedMap

5.6

11

12.913.7

4.4

8.27.27.8

Assisted Manual

Number of test runs

TESTING EFFORT

0

5

10

16

21

26

FilterIterator FixedOrderComparator ListPopulation PredicatedMap

14.3

25

15.8

20

7.7

12.6

9.3

18.5

Assisted Manual

Minutes spent on testing

TESTING EFFORT

0

5

10

16

21

26

FilterIterator FixedOrderComparator ListPopulation PredicatedMap

14.3

25

15.8

20

7.7

12.6

9.3

18.5

Assisted Manual

Minutes spent on testing

Using EvoSuite reduces the time spent on testing.

DOES USING EVOSUITE DURING SOFTWARE DEVELOPMENT LEAD TO

SOFTWARE WITH FEWER BUGS?

RQ 3

IMPLEMENTATION QUALITYGolden test suites run on participants’ implementations

Num

ber o

f Fail

ures

+Err

ors

0

3

6

10

13

16

FilterIterator FixedOrderComparator ListPopulation PredicatedMap

14.3

5.34.3

6.3

15.6

6.4

4.26.1

Assisted Manual

IMPLEMENTATION QUALITYGolden test suites run on participants’ implementations

Num

ber o

f Fail

ures

+Err

ors

0

3

6

10

13

16

FilterIterator FixedOrderComparator ListPopulation PredicatedMap

14.3

5.34.3

6.3

15.6

6.4

4.26.1

Assisted Manual

Using EvoSuite during development did not lead to to better implementations.

DOES SPENDING MORE TIME WITH EVOSUITE AND ITS TESTS LEAD TO

BETTER IMPLEMENTATIONS?

RQ 4

PRODUCTIVITYTime spent with EvoSuite

Corr

elatio

n w

ith

num

ber o

f fail

ures

plus

err

ors

-0.50

-0.40

-0.30

-0.20

-0.10

0.00

0.10

0.20

0.30

0.40

FilterIterator FixedOrderComparator ListPopulation PredicatedMap

-0.490.02

-0.22-0.29

0.350.32

-0.03

0.35

Number of runsTime spent on tests

PRODUCTIVITYTime spent with EvoSuite

Corr

elatio

n w

ith

num

ber o

f fail

ures

plus

err

ors

-0.50

-0.40

-0.30

-0.20

-0.10

0.00

0.10

0.20

0.30

0.40

FilterIterator FixedOrderComparator ListPopulation PredicatedMap

-0.490.02

-0.22-0.29

0.350.32

-0.03

0.35

Number of runsTime spent on tests

Implementation quality improves the more timedevelopers spend with EvoSuite-generated tests.

Using automated unit test generation does impact developers’ productivity,

Using automated unit test generation does impact developers’ productivity,

but…

…how to make the most outof unit test generation tools?

Using automated unit test generation does impact developers’ productivity,

but…

THINK ALOUD OBSERVATIONS

J. Hughes and S. Parkes, “Trends in the use of verbal protocol analysis in software engineering research,” Behaviour and Information Technology, vol. 22, no. 2, pp. 127–140, 2003.

K. A. Ericsson and H. A. Simon, Protocol Analysis: Verbal Reports as Data (revised edition). MIT Press, 1993.

THINK ALOUD OBSERVATIONS

J. Hughes and S. Parkes, “Trends in the use of verbal protocol analysis in software engineering research,” Behaviour and Information Technology, vol. 22, no. 2, pp. 127–140, 2003.

K. A. Ericsson and H. A. Simon, Protocol Analysis: Verbal Reports as Data (revised edition). MIT Press, 1993.

Subject

THINK ALOUD OBSERVATIONS

J. Hughes and S. Parkes, “Trends in the use of verbal protocol analysis in software engineering research,” Behaviour and Information Technology, vol. 22, no. 2, pp. 127–140, 2003.

K. A. Ericsson and H. A. Simon, Protocol Analysis: Verbal Reports as Data (revised edition). MIT Press, 1993.

Observer

Subject

THINK ALOUD OBSERVATIONS

2 hoursClass Template Implementation and Test Suite

Golden Implementation and Test Suite

5 41

THINK ALOUD OBSERVATIONS

2 hoursClass Template Implementation and Test Suite

Golden Implementation and Test Suite

5 41

THINK ALOUD OBSERVATIONS

2 hoursClass Template Implementation and Test Suite

Golden Implementation and Test Suite

5 41

THINK ALOUD OBSERVATIONS

2 hoursClass Template Implementation and Test Suite

Golden Implementation and Test Suite

5 41

THINK ALOUD OBSERVATIONS

2 hoursClass Template Implementation and Test Suite

Golden Implementation and Test Suite

5 41

THINK ALOUD OBSERVATIONS

2 hoursClass Template Implementation and Test Suite

Golden Implementation and Test Suite

5 41

THINK ALOUD OBSERVATIONS

2 hoursClass Template Implementation and Test Suite

Golden Implementation and Test Suite

5 41

THINK ALOUD OBSERVATIONS

2 hoursClass Template Implementation and Test Suite

Golden Implementation and Test Suite

5 41

THINK ALOUD OBSERVATIONS

2 hoursClass Template Implementation and Test Suite

Golden Implementation and Test Suite

5 41

THINK ALOUD OBSERVATIONS

2 hoursClass Template Implementation and Test Suite

Golden Implementation and Test Suite

5 41

RESULTS

Images: http://jozef89.deviantart.com/

3 6 2 2 5

15 20 23 7 10

94% | 92% 97% | 72% 94% | 95% 100% | 94% 100% | 83%

FixedOrderComparator ListPopulationFilterIterator PredicatedMap PredicatedMap

RESULTS

Images: http://jozef89.deviantart.com/

3 6 2 2 5

15 20 23 7 10

94% 97% 94% | 95% 100% 100%

FixedOrderComparator ListPopulationFilterIterator PredicatedMap PredicatedMap

RESULTS

Images: http://jozef89.deviantart.com/

3 6 2 2 5

15 20 23 7 10

94% 97% | 72% 94% 100% 100%

FixedOrderComparator ListPopulationFilterIterator PredicatedMap PredicatedMap

LESSONS LEARNED

LESSONS LEARNED• There are different approaches to testing and test generation tools

should be adaptable to them

LESSONS LEARNED• There are different approaches to testing and test generation tools

should be adaptable to them

• Developers’ behaviour is often not driven by code coverage

LESSONS LEARNED• There are different approaches to testing and test generation tools

should be adaptable to them

• Developers’ behaviour is often not driven by code coverage

• Readability of generated unit tests is paramount

LESSONS LEARNED• There are different approaches to testing and test generation tools

should be adaptable to them

• Developers’ behaviour is often not driven by code coverage

• Readability of generated unit tests is paramount

• Integration into development environments must be improved

LESSONS LEARNED• There are different approaches to testing and test generation tools

should be adaptable to them

• Developers’ behaviour is often not driven by code coverage

• Readability of generated unit tests is paramount

• Integration into development environments must be improved

• Education/Best practices: Developers do not know how tobest use automated test generation tools!

“Coverage is easy to assess because it isa number, while readability is a very non-

tangible property…

“Coverage is easy to assess because it isa number, while readability is a very non-

tangible property…

… What is readable to me may not bereadable to you. It is readable to me justbecause I spent the last hour and a half

doing this.”

“Coverage is easy to assess because it isa number, while readability is a very non-

tangible property…

… What is readable to me may not bereadable to you. It is readable to me justbecause I spent the last hour and a half

doing this.”—Participant 5

j.rojas@sheffield.ac.uk

www.evosuite.org/study-2014/

j.rojas@sheffield.ac.uk

Recommended