Upload
anthony-williamson
View
223
Download
0
Tags:
Embed Size (px)
Citation preview
1
Measuring Similarity of Large Software System Based on Source Code Correspondence
Tetsuo Yamamoto*, Makoto Matsushita**,Toshihiro Kamiya***, Katsuro Inoue**
*Ritsumeikan University, Japan**Osaka University, Japan
***Japan Science and Technology Agency, Japan
2
MotivationLong-lived software systems evolve through multiple modifications. Many different versions are created and delivered
The evolution is not simple and straightforwardIt is common that one original system creates several distinct successor branches during evolutionSeveral distinct versions may be unified later and merged into another version
To manage the many versions correctly and efficiently, it is very important to know objectively their relationships
3
Motivation (Cont.)We have been interested in measuring the similarity between two large software systems This was motivated by our scientific curiosity
such as what is the quantitative similarity of two software systems
We would like to quantify the similarity with a solid and objective measureWe have been interested in comparing all the files It is important that the software similarity
metric is not based on sampled information as the attribute value (or fingerprint), but rather reflect the overall system characteristics
4
Research AimWe measure the similarity between two large software systems Propose a similarity metric Sline
Sline is defined as ratio of shared source code lines to the total source code lines
Sline requires computing matches between source code lines in the two systems, beyond the boundaries of files and directories
Develop a similaritiy metric evaluation tool SMAT (Software similarity MeAsurement Tool) We have evaluated the similarity between various
versions of BSD UNIX We have performed cluster analysis of the similarity
values to create a dendrogram that correctly shows evolution history of BSD UNIX
5
DefinitionsA software system P is composed of elements p1, p2, · · · , pm, and P is represented as a set {p1, p2, · · · , pm}Another software system Q is denoted by {q1, q2, · · · , qn}We will choose the type of elements, such as files and lines, based on the definitions of the similarity metrics
6
Definitions (Cont.)Suppose that we are able to determine matching between pi and qj (1<=i<=m, 1<=j<=n), we call Correspondence Rs the set of matched pair (pi, qj), where
Similarity S of P and Q with respect to Rs is defined as follows
QP
Rs}|),q|(p|{qRs}|),q|(p|{pS(P,Q)
jijjii
QPRs
P Q
7
Similarity MetricWe show a concrete operational similarity metric Sline using equivalent line matching
Each element of a software system is a single line of each source file composing the systemTwo lines with minor distinction such as space/comment modification and identifier rename are recognized as equivalentSline is not affected by file renaming or path changes
8
Measuring SlineA key problem of Sline is computation of the correspondence Rs We propose an approach that effectively uses
both diff and a clone detection tool named CCFinder[1] CCFinder is a tool used to detect duplicated code
blocks (called clones) Diff is a tool used to detect the longest common
subsequence (LCS) between two files diff is applied to all pairs of the two files xi and
yj , where CCFinder detects a clone pair (bx, by) and bx is in xi and by is in yj , respectively
[1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7):654-670, 2002.
9
Similarity Measuring ProcessAll comments, white spaces,
and empty lines are removed
CCFinder has an option for the minimum number of tokens of clones to be detected, and whose default is
set to 20
SMAT executes diff on any file pair xi and yj in X and Y respectively, where at least one clone is detected between xi and yj .
The lines appearing in the clones detected by Step 2 and in the common subsequences found
in Step 3 are merged
Sline is calculated using the ratio of lines in the correspondence to those in whole
systems
10
Diff and CCFinderA straightforward approach we might consider is that first we construct appended files x1; x2; · · · and y1; y2; · · · which are concatenation of all source files x1, x2, · · · and y1, y2, · · · for systems X and Y, respectively
This method is fragile due to the change of file concatenation order caused by internal reshuffling of files
Another approach is that we try to greedily apply diff to all combination of files between two systems
This approach might work, but the scalability would be an issue
When the length of code are less than threshold of CCFinder(usually 20 tokens), then CCFinder reports no clones at all
An approach is proposed that effectively uses both diff and CCFinder
11
Applications of SMATTo explore the applicability of Sline and SMAT, we have used many versions of open-source BSD UNIX operating systems 4.4-BSD Lite, 4.4-BSD Lite2 FreeBSD 2.0, 2.0.5, 2.1, 2.2, 3.0, 4.0 NetBSD 1.0, 1.1, 1.2, 1.3, 1.4, 1.5 OpenBSD 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8
23 major-release versions were chosen for computing Sline of all pair combinations
The evaluation was performed only on source code files related to the OS kernels written in C
12
13
Results (1/2)Sline evolution between FreeBSD 2.2 and other FreeBSD versions
14
Results (2/2)Sline between each version of FreeBSD and some of NetBSD
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
FreeBSD 2.0 FreeBSD 2.0.5 FreeBSD 2.1 FreeBSD 2.2 FreeBSD 3.0 FreeBSD 4.0
NetBSD 1.0
NetBSD 1.1
NetBSD 1.2
NetBSD 1.3
15
Cluster AnalysisThe dendrogram from a cluster analysis is shown
16
ConclusionWe have proposed a similarity metric called Sline
Sline is defined as ratio of shared source code lines to the total source code lines
developed an Sline-based evaluation tool SMAT
applied SMAT to various software systemsSline and SMAT are very useful for
identifying the origin of the systems and to characterize their evolution
17
Future workFurther applications of SMAT to various software systems and product lines will be made to investigate their evolution
18
End
19
Sline and Release DurationThe release durations are calculated from the difference of OS release datesThe Pearson’s correlation coefficient between Sline values and release durations of FreeBSD versions is -0.973The Pearson’s correlation coefficient between the size increases and the release durations is 0.528
We think that Sline is a reasonable measures of release durations in this case
20
The number of files and LOC of BSD UNIX
21
Part of Sline values between BSD UNIX kernel files
22
Outline of CCFinderCCFinder directly compares source
code on token unit, and detects code clones Normalization of name space Replacement of names defined by user Removal of table initialization Consideration of modules delimiter
CCFinder can analyze the system of millions line scale in practical use time
23
Source files
Lexical analysis
Transformation
Token sequence
Match detection
Transformed token sequence
Clones on transformed sequence
Formatting
Clone pairs
1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. }10. static void goo(String [] a) throws RESyntaxException {11. RE exp = new RE("[0-9,]+");12. int sum = 0;13. for (int i = 0; i < a.length; ++i)14. if (exp.match(a[i]))15. sum += parseNumber(exp.getParen(0));16. System.out.println("sum = " + sum);17. }
static void foo ( ) { String a
[ ] = new String [ ] { "123,400" ,
"abc" , "orange 100" } ;
int sum = 0
; for ( int i = 0 ; i <
a . length ; ++ i )
sum
+= pat . getParen 0
; System . out . println ( "sum = "
+ sum ) ; }
throws RESyntaxException
Sample . parseNumber (
) )
if pat
. match a [ i ]( ) )
org . apache . regexp
. RE pat = new org . apache . regexp
. RE ( "[0-9,]+" ) ;
static void goo (
) {
String
a [ ]
int sum = 0
; for ( int i = 0 ; i <
a . length ; ++ i )
System . out . println ( "sum = " + sum
) ; }
throws RESyntaxException
if exp
. match a [ i ]( ) )
exp =
new RE ( "[0-9,]+" ) ;
(
RE
sum
+= exp . getParen 0
;
parseNumber ( ) )(
(
(
[ ] = new String [ ] {
} ;
int sum = 0
; for ( int i = 0 ; i <
a . length ; ++ i )
sum
+= pat . getParen 0
; System . out . println ( "sum = "
+ sum ) ; }
Sample . parseNumber (
) )
if pat
. match a [ i ]( ) )
pat = new
RE ( "[0-9,]+" ) ;
static void goo (
) {
String
a [ ]
int sum = 0
; for ( int i = 0 ; i <
a . length ; ++ i )
System . out . println ( "sum = " + sum
) ; }
throws RESyntaxException
if exp
. match a [ i ]( ) )
exp =
new RE ( "[0-9,]+" ) ;
(
RE
sum
+= exp . getParen 0
;
parseNumber ( (
(
(
static void foo ( ) { String athrows RESyntaxException
$
RE
$ . ) )
Lexical analysis
Transformation
Token sequence
Match detection
Transformed token sequence
Clones on transformed sequence
Formatting
[ ] = new String [ ] {
} ;
int sum = 0
; for ( int i = 0 ; i <
a . length ; ++ i )
sum
+= pat . getParen 0
; System . out . println ( "sum = "
+ sum ) ; }
Sample . parseNumber (
) )
if pat
. match a [ i ]( ) )
pat = new
RE ( "[0-9,]+" ) ;
static void goo (
) {
String
a [ ]
int sum = 0
; for ( int i = 0 ; i <
a . length ; ++ i )
System . out . println ( "sum = " + sum
) ; }
throws RESyntaxException
if exp
. match a [ i ]( ) )
exp =
new RE ( "[0-9,]+" ) ;
(
RE
sum
+= exp . getParen 0
;
parseNumber ( ) )(
(
(
static void foo ( ) { String athrows RESyntaxException
$
RE
$ .
[ ] = [ ] {
} ;
=
; for ( = ; <
. ; ++ )
+= .
; . . (
+ ) ; }
. (
) )
if
. [ ]( ) )
=
( ) ;
static (
) {[ ]
=
; ( = ; <
. ; ++ )
. . ( +
) ; }
throws
if
. [ ]( ) )
=
new ( ) ;
(
+= .
;
( ) )(
(
(
static $ ( ) {throws
$
$ .
$ $ $ $
$ $
$ $
$ $ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $ $
new
forfor
new
[ ] = [ ] {
} ;
=
; for ( = ; <
. ; ++ )
+= .
; . . (
+ ) ; }
. (
) )
if
. [ ]( ) )
=
( ) ;
static (
) {[ ]
=
; ( = ; <
. ; ++ )
. . ( +
) ; }
throws
if
. [ ]( ) )
=
new ( ) ;
(
+= .
;
( ) )(
(
(
static $ ( ) {throws
$
$ .
$ $ $ $
$ $
$ $
$ $ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $
$ $ $ $ $
Lexical analysis
Transformation
Token sequence
Match detection
Transformed token sequence
Clones on transformed sequence
Formatting
1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. }10. static void goo(String [] a) throws RESyntaxException {11. RE exp = new RE("[0-9,]+");12. int sum = 0;13. for (int i = 0; i < a.length; ++i)14. if (exp.match(a[i]))15. sum += parseNumber(exp.getParen(0));16. System.out.println("sum = " + sum);17. }
Lexical analysis
Transformation
Token sequence
Match detection
Transformed token sequence
Clones on transformed sequence
Formatting
CCFinder:
Clone Detection Process