27
1 Gemini: Code Clone Analysis Tool †Graduate School of Engineering Science, Osaka Univ., Japan duate School of Information Science and Technology, Osaka Un iv., Japan *PRESTO, Japan Science and Technology Corp., Japan {y-ueda, y-higo, kamiya, kusumoto, inoue}@ist.osaka-u.ac.jp Yasushi Ueda†, Yoshiki Higo‡, Toshihiro Kamiya*, Shinji Kusumoto‡, and Katsuro Inoue‡

Gemini: Code Clone Analysis Tool

  • Upload
    lenci

  • View
    39

  • Download
    1

Embed Size (px)

DESCRIPTION

Gemini: Code Clone Analysis Tool. Yasushi Ueda † , Yoshiki Higo ‡ , Toshihiro Kamiya*, Shinji Kusumoto ‡ , and Katsuro Inoue ‡. †Graduate School of Engineering Science, Osaka Univ., Japan ‡ Graduate School of Information Science and Technology, Osaka Univ., Japan - PowerPoint PPT Presentation

Citation preview

Page 1: Gemini: Code Clone Analysis Tool

1

Gemini: Code Clone Analysis Tool

†Graduate School of Engineering Science, Osaka Univ., Japan ‡Graduate School of Information Science and Technology, Osaka Univ., Japan

*PRESTO, Japan Science and Technology Corp., Japan{y-ueda, y-higo, kamiya, kusumoto, inoue}@ist.osaka-u.ac.jp

Yasushi Ueda†, Yoshiki Higo‡, Toshihiro Kamiya*,Shinji Kusumoto‡, and Katsuro Inoue‡

Page 2: Gemini: Code Clone Analysis Tool

2

ContentsBackgroundCode Clone Analysis Tool, Gemini

OverviewSystem structureScatter Plot

Page 3: Gemini: Code Clone Analysis Tool

3

Background (1/2)

A code clone is a pair/set of code portions in source files that are identical or similar to each other.

Page 4: Gemini: Code Clone Analysis Tool

4

Background (2/2) Code clone is one of the factors that make

software maintenance more difficult. If some faults are found in a code portion, it is

necessary to correct the faults in its all clone pairs.

[1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”,

IEEE Transactions on Software Engineering, 28(7):654-670, 2002.

We have developed a code clone detection tool, CCFinder [1]. Token-based clone detector Its input is a set of source files and

output is the locations of clone pairs.

Page 5: Gemini: Code Clone Analysis Tool

5

Source files

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

Clone pairs

1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. }10. static void goo(String [] a) throws RESyntaxException {11. RE exp = new RE("[0-9,]+");12. int sum = 0;13. for (int i = 0; i < a.length; ++i)14. if (exp.match(a[i]))15. sum += parseNumber(exp.getParen(0));16. System.out.println("sum = " + sum);17. }

static void foo ( ) { String a

[ ] = new String [ ] { "123,400" ,

"abc" , "orange 100" } ;

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

sum

+= pat . getParen 0

; System . out . println ( "sum = "

+ sum ) ; }

throws RESyntaxException

Sample . parseNumber (

) )

if pat

. match a [ i ]( ) )

org . apache . regexp

. RE pat = new org . apache . regexp

. RE ( "[0-9,]+" ) ;

static void goo (

) {

String

a [ ]

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

System . out . println ( "sum = " + sum

) ; }

throws RESyntaxException

if exp

. match a [ i ]( ) )

exp =

new RE ( "[0-9,]+" ) ;

(

RE

sum

+= exp . getParen 0

;

parseNumber ( ) )(

(

(

[ ] = new String [ ] {

} ;

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

sum

+= pat . getParen 0

; System . out . println ( "sum = "

+ sum ) ; }

Sample . parseNumber (

) )

if pat

. match a [ i ]( ) )

pat = new

RE ( "[0-9,]+" ) ;

static void goo (

) {

String

a [ ]

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

System . out . println ( "sum = " + sum

) ; }

throws RESyntaxException

if exp

. match a [ i ]( ) )

exp =

new RE ( "[0-9,]+" ) ;

(

RE

sum

+= exp . getParen 0

;

parseNumber ( (

(

(

static void foo ( ) { String athrows RESyntaxException

$

RE

$ . ) )

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

[ ] = new String [ ] {

} ;

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

sum

+= pat . getParen 0

; System . out . println ( "sum = "

+ sum ) ; }

Sample . parseNumber (

) )

if pat

. match a [ i ]( ) )

pat = new

RE ( "[0-9,]+" ) ;

static void goo (

) {

String

a [ ]

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

System . out . println ( "sum = " + sum

) ; }

throws RESyntaxException

if exp

. match a [ i ]( ) )

exp =

new RE ( "[0-9,]+" ) ;

(

RE

sum

+= exp . getParen 0

;

parseNumber ( ) )(

(

(

static void foo ( ) { String athrows RESyntaxException

$

RE

$ .

[ ] = [ ] {

} ;

=

; for ( = ; <

. ; ++ )

+= .

; . . (

+ ) ; }

. (

) )

if

. [ ]( ) )

=

( ) ;

static (

) {[ ]

=

; ( = ; <

. ; ++ )

. . ( +

) ; }

throws

if

. [ ]( ) )

=

new ( ) ;

(

+= .

;

( ) )(

(

(

static $ ( ) {throws

$

$ .

$ $ $ $

$ $

$ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

new

for

CCFinder Example of clone detection process

for

new

[ ] = [ ] {

} ;

=

; for ( = ; <

. ; ++ )

+= .

; . . (

+ ) ; }

. (

) )

if

. [ ]( ) )

=

( ) ;

static (

) {[ ]

=

; ( = ; <

. ; ++ )

. . ( +

) ; }

throws

if

. [ ]( ) )

=

new ( ) ;

(

+= .

;

( ) )(

(

(

static $ ( ) {throws

$

$ .

$ $ $ $

$ $

$ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. }10. static void goo(String [] a) throws RESyntaxException {11. RE exp = new RE("[0-9,]+");12. int sum = 0;13. for (int i = 0; i < a.length; ++i)14. if (exp.match(a[i]))15. sum += parseNumber(exp.getParen(0));16. System.out.println("sum = " + sum);17. }

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

0.1 3,1 9,1 11,1 17,1

Page 6: Gemini: Code Clone Analysis Tool

6

Gemini overview A GUI-based code clone analysis tool

Uses CCFinder as a code clone detector. Has several views to interactive analysis.

Scatter plot view Select by mouse dragging Sorting function Zoom in/out

Metric graph view Select by metric values

Source code view Implemented in Java

About 10,000 lines of code

Page 7: Gemini: Code Clone Analysis Tool

7

Scatter plot Both the vertical and

horizontal axes represent a token sequence of source code.

A dot means that corresponding two tokens on the two axes are same. The main diagonal line is

always drawn, because each dot on it refers to an identical position of the two axes.

A clone pair is shown as a diagonal line segment.

The distribution is symmetrical with the main diagonal line.

a b c a b c a d e c a

b c a

b c a

d e

c

a, b, c, ... : tokens : matched position

Page 8: Gemini: Code Clone Analysis Tool

8

Sorting function When multiple files are compared in scatter plot,

boundaries of their files are shown on the axes. Depending on the file orders,

the distribution of dots is spread widely. We put similar files as near as possible.

Page 9: Gemini: Code Clone Analysis Tool

9

Snapshots of Gemini

Page 10: Gemini: Code Clone Analysis Tool

10

ConclusionsWe presented a maintenance support enviro

nment based on code clone analysis, Gemini.

We are going to evaluate the applicability to large scale softwares in actual maintenance as future research work.

Page 11: Gemini: Code Clone Analysis Tool

11

Page 12: Gemini: Code Clone Analysis Tool

12

CCFinder: Implementation CCFinder extracts code clones by direct compa

rison of source text. It transforms source text for precise and effect

ive detection of code clones. Token-based transformation rules to regularize and

select code portion, for Java, C++, COBOL, etc. programs

It uses an effective matching algorithm for large source code. Complexity of algorithm: O(n), where n is a length of

source code Scalability: 108 min. for 7.2 million lines (Pentium I

II 650 MHz, 640MB memory)

Page 13: Gemini: Code Clone Analysis Tool

13

The difference between ‘diff’ and clone detection toolsDiff finds the longest common sub-

string.Given a code portion, diff does not

report two or more same code portions (clones).

Clone detection tool finds all the same or similar code portions.

Page 14: Gemini: Code Clone Analysis Tool

14

Example of transformation rules in Java All identifiers defined by user are transformed to same toke

ns. Unique identifier is inserted at each end of the top-level defi

nitions and declarations. Prevents detecting clones that begin at the middle of class definitio

n and end at the middle of another one. ”java. lang. Math. PI” is transformed to ”Math. PI”.

By using import sentence, a class is referred to with either full package name or a shorter name

” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ” Eliminates table initialization code.

Page 15: Gemini: Code Clone Analysis Tool

15

Clone class metrics LEN (C ): Length of token sequence of each element in clone class C

POP (C ): Number of elements in clone class C

RAD (C ): Distribution in the file system of elements in clone class C

DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone

class C are replaced with caller statements of a new identical routine

new sub routinecaller

statements

Page 16: Gemini: Code Clone Analysis Tool

16

Snapshots of clone class metric graph

RAD LEN POP DFL

Filtering mode : ON

Page 17: Gemini: Code Clone Analysis Tool

17

Aims of clone class metricsWe are interested in

Clone classes whose elements are spread widely. High value of POP means that there are many similar

code fragments. High value of RAD means that the clones are spread

over many subsystems. They are difficult to find all together in maintenance.

Clone classes which are appropriate for refactoring. High value of DFL (high value POP and high value of

LEN) means that the clone class is worth evaluating whether the elements can be merged into one routine.

Page 18: Gemini: Code Clone Analysis Tool

18

Definition of DFL and RAD DFL(C )

DFL(C) = LEN(C) ×POP(C) - 5×POP(C) + LEN(C) LEN(C) ×POP(C) : the target code size for restructuring5×POP(C) : the code size of new caller statements LEN(C) : the code size of new identical routine

RAD (C ) Distribution in the file system of elements in clone class C

RAD(C) = 0 : C is enclosed within a single file.RAD(C) = 1 : C is enclosed within a single directory.RAD(C) = n : C is enclosed within a directory tree of n

layers.

new sub routinecaller

statements

Page 19: Gemini: Code Clone Analysis Tool

19

CCFinder (3/4)Application of CCFinder

Free softwareJDK libraries (Java, 570 KLOC)Linux, FreeBSD (C, 1.6 + 1.3 MLOC)FreeBSD, OpenBSD , NetBSD(C)Qt(C++ , 240KLOC)

Commercial softwareNTT data Corp., Hitachi Ltd., NEC soft Ltd.,

ASTEC Inc., SRA Inc.NASDA (Control program for rocket)

Page 20: Gemini: Code Clone Analysis Tool

20

CCFinder (4/4) Output of CCFinder

#version: ccfinder 3.1

#langspec: JAVA

#option: -b 30,1

#option: -k +

#option: -r abcdfikmnprsv

#option: -c wfg

#begin{file description}

0.0 52 C:\Gemini.java

0.1 94 C:\GeneralManager.java

:

:

#end{file description}

#begin{clone}

0.1 53,9 63,13 1.10 542,9 553,13 35

0.1 53,9 63,13 1.10 624,9 633,13 35

0.2 124,9 152,31 0.2 154,9 216,51 42

       :

:

#end{clone}

Object file ID( file 0 in Group 0 )

Location of a clone pair( Lines 53 - 63 in file 0.1 and Lines 542 - 553 in file 1.10 are identical or similar to each other)

It is difficult to analyze source code by only this text-based information of the location of clone pairs.

Page 21: Gemini: Code Clone Analysis Tool

21

Clone pair manager

Clone pair manager

Metrics manager

Metrics manager

Scatter plot view

Metric graph views

User Interfaces

System structure of Gemini

Source files

Source code manager

Source code manager

Source code viewClone

selection information

Clone selection

information

User

Gemini

Code clone detector

Code clone detector

CCFinder

Code clone database

Page 22: Gemini: Code Clone Analysis Tool

22

Source files

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

Clone pairs

1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. }10. static void goo(String [] a) throws RESyntaxException {11. RE exp = new RE("[0-9,]+");12. int sum = 0;13. for (int i = 0; i < a.length; ++i)14. if (exp.match(a[i]))15. sum += parseNumber(exp.getParen(0));16. System.out.println("sum = " + sum);17. }

static void foo ( ) {

String a [ ] = new String [ ] { "123,400" , "abc" , "orange 100" } ;

int sum = 0 ;

for ( int i = 0 ; i < a . length ; ++ i )

sum += pat . getParen 0 ;

System . out . println ( "sum = " + sum ) ;

}

throws RESyntaxException

Sample . parseNumber ( ) )

if pat . match a [ i ]( ) )

org . apache . regexp . RE pat = new org . apache . regexp . RE ( "[0-9,]+" ) ;

static void goo ( ) {String a [ ]

int sum = 0 ;

for ( int i = 0 ; i < a . length ; ++ i )

System . out . println ( "sum = " + sum ) ;

}

throws RESyntaxException

if exp . match a [ i ]( ) )

exp = new RE ( "[0-9,]+" ) ;

(

RE

sum += exp . getParen 0 ;parseNumber ( ) )(

(

(

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

CCFinder

static $p ( ) {

[ ] = new [ ] { $u } ;

= ;

for ( = ; < . ; ++ )

+= . ;

. . ( + ) ;

}

throws

. ( ) )

if . [ ]( ) )

= new ( ) ;

static ( ) {[ ]

= ;

for ( = ; < . ; ++ )

. . ( + ) ;

}

throws

if . [ ]( ) )

= new ( ) ;

(

+= . ;( ) )(

(

(

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

.

$p $p

$p $p

$p $p $p $p

$p $p $p

$p $p $p $p $p $p $p

$p $p $p $p

$p $p $p $p $p $p

$p $p $p $p $p

$p $p $p $p $p

$p $p $p $p

$p $p $p

$p $p $p $p $p $p $p

$p $p $p $p

$p $p $p $p $p $p

$p $p $p $p $p

$p Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

static void foo ( ) {

String a [ ] = new String [ ] { $u } ;

int sum = 0 ;

for ( int i = 0 ; i < a . length ; ++ i )

sum += pat . getParen 0 ;

System . out . println ( "sum = " + sum ) ;

}

throws RESyntaxException

Sample . parseNumber ( ) )

if pat . match a [ i ]( ) )

RE pat = new RE ( "[0-9,]+" ) ;

static void goo ( ) {String a [ ]

int sum = 0 ;

for ( int i = 0 ; i < a . length ; ++ i )

System . out . println ( "sum = " + sum ) ;

}

throws RESyntaxException

if exp . match a [ i ]( ) )

exp = new RE ( "[0-9,]+" ) ;

(

RE

sum += exp . getParen 0 ;parseNumber ( ) )(

(

(

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

$p .

Example of clone detection process

static $p ( ) {

[ ] = new [ ] { $u } ;

= ;

for ( = ; < . ; ++ )

+= . ;

. . ( + ) ;

}

throws

. ( ) )

if . [ ]( ) )

= new ( ) ;

static ( ) {[ ]

= ;

for ( = ; < . ; ++ )

. . ( + ) ;

}

throws

if . [ ]( ) )

= new ( ) ;

(

+= . ;( ) )(

(

(

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

.

$p $p

$p $p

$p $p $p $p

$p $p $p

$p $p $p $p $p $p $p

$p $p $p $p

$p $p $p $p $p $p

$p $p $p $p $p

$p $p $p $p $p

$p $p $p $p

$p $p $p

$p $p $p $p $p $p $p

$p $p $p $p

$p $p $p $p $p $p

$p $p $p $p $p

$p Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. }10. static void goo(String [] a) throws RESyntaxException {11. RE exp = new RE("[0-9,]+");12. int sum = 0;13. for (int i = 0; i < a.length; ++i)14. if (exp.match(a[i]))15. sum += parseNumber(exp.getParen(0));16. System.out.println("sum = " + sum);17. }

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

0.1 3,1, 9,1 11,1 17,1

Page 23: Gemini: Code Clone Analysis Tool

23

Suffix-tree Suffix tree is a tree that satisfies the following

conditions.1. A leaf node represents the starting

position of sub-string.2. A path from root node to a leaf node

represents a sub-string.3. First characters of labels

of all the edges from one node are different from each other.

→ A common path means a clone

x

y

z%

%

xyxyz%

y

xyz%

z%

xyz%

z%

1

2

43

56

71 2 3 4 5 6 7x x y x y z %

1 2 3 4 5 6 7x x y x y z %

1 x *2 x * *3 y *4 x * * *5 y * *6 z *7 % *

Page 24: Gemini: Code Clone Analysis Tool

24

Case study overviewApplication target

Programs developed in a programming exercise of Osaka Univ.Compiler in C languagePrograms of 69 studentsTotal size is 360,000 lines of code

Issue of AnalysisSimilarity among all programs

In the programming exercise, plagiarisms sometimes happen.

Page 25: Gemini: Code Clone Analysis Tool

25

Analysis (1/2) Compiler of 69

students are arranged on the two axes.

The distribution is spread widely.

Rearrangement of

scatter plot using sorting function

The grid represents boundary lines between individuals.

Page 26: Gemini: Code Clone Analysis Tool

26

Analysis (2/2)

A

B

The corresponding code A (2 students)

Similar code fragments were from source code of sample compiler described in textbook.

B (4 students)

Many code fragments were similar even with respect to name of variables or comments.

Page 27: Gemini: Code Clone Analysis Tool

27

RSA(i) : Ratio of covered code range in file i by clones between one file i   of other files

Step2:From among the remaining files, select the most similar file to F and put it next toF by the value of RST

RST(i,j) : Ratio of covered code range in file i by clones between a file i and a file j

f1

f1

Sorting functionStep1:

Select a head file by the value of RSA(Make F the head file)

Step3:Repeat step2 recursively while any file remains, treating the most similar file in previous step2 as new F

f1f2

f3f4

f5f6

f1 f2 f3 f4 f5 f6

f1f6

f1 f6

f1f6

f3

f1 f6 f3

f1f6

f3

f1 f6 f3

f4f4

f1f6

f3

f1 f6 f3

f4f4

f2f5

f5f2