42
Mining and Analysis of Control Structure Variant Clones Guo Qiao

Mining and Analysis of Control Structure Variant Clones Guo Qiao

Embed Size (px)

Citation preview

Mining and Analysis of Control Structure Variant Clones

Guo Qiao

Outline

• Clones and Control Structure Variant Clones• Research Motivation• Approach for mining control structure variant clones• Evaluation of precision and recall• Case study of control structure variant clones• Refactorability evaluation

2

• Clones are common in software systems. The percentage of clones in systems varied from 6.5% to 59.5%, average proportion is 14.6%. (Chen et al. @2014)

Code duplication (Software Clone)

3

Concordia university

• Clones are harmful

• Identified as the worst code smell (Rahman @2010)• Indication of poor software maintainability

(Mondal @2011)• Cause system design quality degrade

Why clone is a problem?

Clone refactoring can eliminate bad effects.

4

• Type-1: Identical code fragments except for variations in whitespace, layout and comments. (Clear)

• Type-2: Syntactically identical fragments except for variations in identifiers, literals, types, whitespace, layout and comments. (Clear)

• Type-3: Copied fragments with further modifications such as changed, added or removed statements, in addition Type-1 variation.

• Type-4: Two or more code fragments that perform the same computation but are implemented by different syntax text.

Clone CategorizationMost widely accepted definition is from Roy @2009

5

• Type-4 clones can be divided into subcategories.

Dispute about Type-4 Clones

• Type-4 clones are syntactically different semantic clones and still undecidable. • Type-4 clones are behaviorally similar code

fragments regarding to their input/output.

6

Definition• Control structure variant clones (CSVC) are

clones use different control structures to implement the same functionality.

Control Structure Variant Clone?

7

From the perspective of clone refactoring, a different strategy is required to refactor Control Structure variant clones. Extract common code fragment Analysis of code functionality

Motivation

8

Jürgens et al [2010] on the clones beyond copy-paste revealed:

– The state-of-the-art clone detectors did not achieve a recall of more than 10%.

– In 52 manually checked methods, 32 were behaviorally similar but syntactically different to other methods.

No approach tailored to find these clones

Motivation

9

Propose an approach to mine control structure variant clones accurately.

The mining process should take into account:1. Control structure matching2. Functional similarity evaluation

Goal

10

Overall Approach

Code example Control Dependency Tree

Phase 1: Control Structure Matching

12

•Loop variants• Enhanced for loop• Iterator-based for or while loop• Index-based for or while loop• Do-while loop

•Conditional variants• If-else statement• Conditional expression (Ternary operator ?: )• Switch statement

Common Control Structures in Java

13

Loop Variable: • Start index • End index • Step

We consider two loops L1 and L2 as functionally equivalent, if they have the same loop variable value.

Unified Representation of Loops

14

Control Structure Equivalents

15

• Start index

Control Structure Matching

16

• End index

Control Structure Matching

17

• Update Step

Control Structure Matching

18

Conditional Variant Equivalents

19

Java Binding: unique string representing a variable, object type, or method invocation.

IBinding: • IMethodBinding • ITypeBinding• IVariableBinding (Excluded)

Phase 2: Function Similarity Evaluation

20

IMethodBinding represents method signatures.ITypeBinding represents the Java types.

Binding Information

21

1. All Collection subtypes are generalized to java.util.Collection.

Post-processing of Bindings

22

2. Ignore the binding keys of the methods which access the next element.

Post-processing of Bindings

23

• Jaccard Similarity Coefficient

• Specify the threshold Φ

Quantify Functional Similarity

24

Study Setup• Select projects.• Select clone detection tool.• Investigation of the results.

Evaluation

25

• 6 open-source systems from different domain, vary in size and history.

Selection of Projects

26

• Three criteria for tool selection:1. Able to detect clones with control structure variations.2. Available for download.3. Take a reasonable time to detect clones.

• Tried five different clone detection tools:CCFinder –Not able to find semantic cloneJSCtracker –Not able to finish detection processNiCad–Returns abnormal clone groupsDeckard—Not able to finish detectionSebyte works well for our experiment

Selection of Detection tool

27

• Trade off between precision and recall• Identify 285 true positives (TP), 475 false

positives (FP)

Best Threshold

28

Threshold value 0.5 achieved a performance score of 0.64 (precision), and 0.91 (recall)

Best Threshold

29

Average 8.8 milliseconds for each clone pair

Execution Time

30

Q1 : Which variation is most frequently occurring?

Q2 : Does the evolution of a programming language affect the introduction of control structure variant clones?

Case Study

31

• 6 different loops, make 15 combinations, 7 of them have instances

Case Study

32

Fact: The largest category is Enhanced for loop VS Iterator-based while loop, which has 109 instances.

Answer to Q1: Enhanced for loop and Iterator-based while loop appear most often

Case Study

33

Fact: Enhanced for loop is involved in all top 3 categories, they have 209 clone pairs, account for 73%

Answer to Q2: Enhanced for loop introduced in Java 5, significantly affects the introduction of control structure variant clone.

Case Study

34

State-of-the-art refactoring tool--JDeodorant

Clone Refactoring Evaluation

35

Initialization of arrays from collectionsVariations Hindering Refactoring

36

Clone 1

Clone 2

Temporary variablesVariations Hindering Refactoring

37

Clone 1

Clone 2

Exchange of method invocation expressionsVariations Hindering Refactoring

38

Clone 1

Clone 2

A B

B A

Alternative branching statementsVariations Hindering Refactoring

39

Clone 1

Clone 2

Conclusion

• Control structure variant clones do exist in systems

• They are introduced because the language evolves, e.g., the new feature Enhanced For

• 42% of the clones we found are refactorable

40

• Improve the approach to convert one data structure to another to refactor an additional 19% of the control structure variant clones.

Future Work

41

• Develop code to unify different control structures and perform the refactoring.

Thanks!

42

Visit our Benchmark of Control structure variant clones athttp://users.encs.concordia.ca/~nikolaos/IWSC_2015/