1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto, Makoto Matsushita, Toshihiro Kamiya, Katsuro

1

Measuring Similarity of Large Software System Based on Source Code Correspondence

Tetsuo Yamamoto*, Makoto Matsushita**,Toshihiro Kamiya***, Katsuro Inoue**

*Ritsumeikan University, Japan**Osaka University, Japan

***Japan Science and Technology Agency, Japan

2

MotivationLong-lived software systems evolve through multiple modifications. Many different versions are created and delivered

The evolution is not simple and straightforwardIt is common that one original system creates several distinct successor branches during evolutionSeveral distinct versions may be unified later and merged into another version

To manage the many versions correctly and efficiently, it is very important to know objectively their relationships

3

Motivation (Cont.)We have been interested in measuring the similarity between two large software systems This was motivated by our scientific curiosity

such as what is the quantitative similarity of two software systems

We would like to quantify the similarity with a solid and objective measureWe have been interested in comparing all the files It is important that the software similarity

metric is not based on sampled information as the attribute value (or fingerprint), but rather reflect the overall system characteristics

4

Research AimWe measure the similarity between two large software systems Propose a similarity metric Sline

Sline is defined as ratio of shared source code lines to the total source code lines

Sline requires computing matches between source code lines in the two systems, beyond the boundaries of files and directories

Develop a similaritiy metric evaluation tool SMAT (Software similarity MeAsurement Tool) We have evaluated the similarity between various

versions of BSD UNIX We have performed cluster analysis of the similarity

values to create a dendrogram that correctly shows evolution history of BSD UNIX

5

DefinitionsA software system P is composed of elements p1, p2, · · · , pm, and P is represented as a set {p1, p2, · · · , pm}Another software system Q is denoted by {q1, q2, · · · , qn}We will choose the type of elements, such as files and lines, based on the definitions of the similarity metrics

6

Definitions (Cont.)Suppose that we are able to determine matching between pi and qj (1<=i<=m, 1<=j<=n), we call Correspondence Rs the set of matched pair (pi, qj), where

Similarity S of P and Q with respect to Rs is defined as follows

QP

Rs}|),q|(p|{qRs}|),q|(p|{pS(P,Q)

jijjii

QPRs

P Q

7

Similarity MetricWe show a concrete operational similarity metric Sline using equivalent line matching

Each element of a software system is a single line of each source file composing the systemTwo lines with minor distinction such as space/comment modification and identifier rename are recognized as equivalentSline is not affected by file renaming or path changes

8

Measuring SlineA key problem of Sline is computation of the correspondence Rs We propose an approach that effectively uses

both diff and a clone detection tool named CCFinder[1] CCFinder is a tool used to detect duplicated code

blocks (called clones) Diff is a tool used to detect the longest common

subsequence (LCS) between two files diff is applied to all pairs of the two files xi and

yj , where CCFinder detects a clone pair (bx, by) and bx is in xi and by is in yj , respectively

[1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7):654-670, 2002.

9

Similarity Measuring ProcessAll comments, white spaces,

and empty lines are removed

CCFinder has an option for the minimum number of tokens of clones to be detected, and whose default is

set to 20

SMAT executes diff on any file pair xi and yj in X and Y respectively, where at least one clone is detected between xi and yj .

The lines appearing in the clones detected by Step 2 and in the common subsequences found

in Step 3 are merged

Sline is calculated using the ratio of lines in the correspondence to those in whole

systems

10

Diff and CCFinderA straightforward approach we might consider is that first we construct appended files x1; x2; · · · and y1; y2; · · · which are concatenation of all source files x1, x2, · · · and y1, y2, · · · for systems X and Y, respectively

This method is fragile due to the change of file concatenation order caused by internal reshuffling of files

Another approach is that we try to greedily apply diff to all combination of files between two systems

This approach might work, but the scalability would be an issue

When the length of code are less than threshold of CCFinder(usually 20 tokens), then CCFinder reports no clones at all

An approach is proposed that effectively uses both diff and CCFinder

11

Applications of SMATTo explore the applicability of Sline and SMAT, we have used many versions of open-source BSD UNIX operating systems 4.4-BSD Lite, 4.4-BSD Lite2 FreeBSD 2.0, 2.0.5, 2.1, 2.2, 3.0, 4.0 NetBSD 1.0, 1.1, 1.2, 1.3, 1.4, 1.5 OpenBSD 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8

23 major-release versions were chosen for computing Sline of all pair combinations

The evaluation was performed only on source code files related to the OS kernels written in C

12

13

Results (1/2)Sline evolution between FreeBSD 2.2 and other FreeBSD versions

14

Results (2/2)Sline between each version of FreeBSD and some of NetBSD

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

FreeBSD 2.0 FreeBSD 2.0.5 FreeBSD 2.1 FreeBSD 2.2 FreeBSD 3.0 FreeBSD 4.0

NetBSD 1.0

NetBSD 1.1

NetBSD 1.2

NetBSD 1.3

15

Cluster AnalysisThe dendrogram from a cluster analysis is shown

16

ConclusionWe have proposed a similarity metric called Sline

Sline is defined as ratio of shared source code lines to the total source code lines

developed an Sline-based evaluation tool SMAT

applied SMAT to various software systemsSline and SMAT are very useful for

identifying the origin of the systems and to characterize their evolution

17

Future workFurther applications of SMAT to various software systems and product lines will be made to investigate their evolution

18

End

19

Sline and Release DurationThe release durations are calculated from the difference of OS release datesThe Pearson’s correlation coefficient between Sline values and release durations of FreeBSD versions is -0.973The Pearson’s correlation coefficient between the size increases and the release durations is 0.528

We think that Sline is a reasonable measures of release durations in this case

20

The number of files and LOC of BSD UNIX

21

Part of Sline values between BSD UNIX kernel files

22

Outline of CCFinderCCFinder directly compares source

code on token unit, and detects code clones Normalization of name space Replacement of names defined by user Removal of table initialization Consideration of modules delimiter

CCFinder can analyze the system of millions line scale in practical use time

23

Source files

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

Clone pairs

1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. }10. static void goo(String [] a) throws RESyntaxException {11. RE exp = new RE("[0-9,]+");12. int sum = 0;13. for (int i = 0; i < a.length; ++i)14. if (exp.match(a[i]))15. sum += parseNumber(exp.getParen(0));16. System.out.println("sum = " + sum);17. }

static void foo ( ) { String a

[ ] = new String [ ] { "123,400" ,

"abc" , "orange 100" } ;

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

sum

+= pat . getParen 0

; System . out . println ( "sum = "

+ sum ) ; }

throws RESyntaxException

Sample . parseNumber (

) )

if pat

. match a [ i ]( ) )

org . apache . regexp

. RE pat = new org . apache . regexp

. RE ( "[0-9,]+" ) ;

static void goo (

) {

String

a [ ]

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

System . out . println ( "sum = " + sum

) ; }


if exp

. match a [ i ]( ) )

exp =

new RE ( "[0-9,]+" ) ;

(

RE

sum

+= exp . getParen 0

;

parseNumber ( ) )(

(

(

[ ] = new String [ ] {

} ;

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

sum

+= pat . getParen 0


+ sum ) ; }


) )

if pat

. match a [ i ]( ) )

pat = new

RE ( "[0-9,]+" ) ;

static void goo (

) {

String

a [ ]

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )


) ; }


if exp

. match a [ i ]( ) )

exp =

new RE ( "[0-9,]+" ) ;

(

RE

sum

+= exp . getParen 0

;

parseNumber ( (

(

(

static void foo ( ) { String athrows RESyntaxException

$

RE

$ . ) )

Lexical analysis

Transformation

Token sequence

Match detection



Formatting

[ ] = new String [ ] {

} ;

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

sum

+= pat . getParen 0


+ sum ) ; }


) )

if pat

. match a [ i ]( ) )

pat = new

RE ( "[0-9,]+" ) ;

static void goo (

) {

String

a [ ]

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )


) ; }


if exp

. match a [ i ]( ) )

exp =

new RE ( "[0-9,]+" ) ;

(

RE

sum

+= exp . getParen 0

;

parseNumber ( ) )(

(

(

static void foo ( ) { String athrows RESyntaxException

$

RE

$ .

[ ] = [ ] {

} ;

=

; for ( = ; <

. ; ++ )

+= .

; . . (

+ ) ; }

. (

) )

if

. [ ]( ) )

=

( ) ;

static (

) {[ ]

=

; ( = ; <

. ; ++ )

. . ( +

) ; }

throws

if

. [ ]( ) )

=

new ( ) ;

(

+= .

;

( ) )(

(

(

static $ ( ) {throws

$

$ .

$ $ $ $

$ $

$ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

new

forfor

new

[ ] = [ ] {

} ;

=

; for ( = ; <

. ; ++ )

+= .

; . . (

+ ) ; }

. (

) )

if

. [ ]( ) )

=

( ) ;

static (

) {[ ]

=

; ( = ; <

. ; ++ )

. . ( +

) ; }

throws

if

. [ ]( ) )

=

new ( ) ;

(

+= .

;

( ) )(

(

(

static $ ( ) {throws

$

$ .

$ $ $ $

$ $

$ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

Lexical analysis

Transformation

Token sequence

Match detection



Formatting

1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. }10. static void goo(String [] a) throws RESyntaxException {11. RE exp = new RE("[0-9,]+");12. int sum = 0;13. for (int i = 0; i < a.length; ++i)14. if (exp.match(a[i]))15. sum += parseNumber(exp.getParen(0));16. System.out.println("sum = " + sum);17. }

Lexical analysis

Transformation

Token sequence

Match detection



Formatting

CCFinder:

Clone Detection Process

Documents

1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro

1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto, Makoto Matsushita, Toshihiro Kamiya, Katsuro