62
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University ICECCS 2017 Exploring Similar Code - From Code Clone Detection to Provenance Identification - Katsuro Inoue Osaka University 1

ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

ICECCS 2017Exploring Similar Code

- From Code Clone Detection to Provenance Identification -

Katsuro InoueOsaka University

1

Page 2: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

My Background

2

Ph.D. for Implementation & Optimization of Functional Programs

84

Software Process Modeling & Formalization

Program Slicing

Mining Soft. Repositories,Code Clone, Code Search

Code Provenance, License Analysis

’90 ’95 ’00 ’05 ’10 ’15 ’17

Formal Empirical

Page 3: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Welcome to ICECCS 2017The 22nd International Conference on Engineering of Complex Computer Systems (ICECCS 2017) will be held on November 6-8, 2017, Fukuoka, Japan. Complex computer systems are common in many sectors, such as manufacturing, communications, defense, transportation, aerospace, hazardous environments, energy, and health care. These systems are frequently distributed over heterogeneous networks, and are driven by many diverse requirements on performance, real-time behavior, fault tolerance, security, adaptability, development time and cost, long life concerns, and other areas. Such requirements frequently conflict, and their satisfaction therefore requires managing the trade-off among them during system development and throughout the entire system life. The goal of this conference is to bring together industrial, academic, and government experts, from a variety of user domains and software disciplines, to determine how the disciplines' problems and solution ... 3

Page 4: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Welcome to ICECCS 2017The 22nd International Conference on Engineering of Complex Computer Systems (ICECCS 2017) will be held on November 6-8, 2017, Fukuoka, Japan. Complex computer systems are common in many sectors, such as manufacturing, communications, defense, transportation, aerospace, hazardous environments, energy, and health care. These systems are frequently distributed over heterogeneous networks, and are driven by many diverse requirements on performance, real-time behavior, fault tolerance, security, adaptability, development time and cost, long life concerns, and other areas. Such requirements frequently conflict, and their satisfaction therefore requires managing the trade-off among them during system development and throughout the entire system life. The goal of this conference is to bring together industrial, academic, and government experts, from a variety of user domains and software disciplines, to determine how the disciplines' problems and solution ... 4

Average: 2.3% ICECCS: 6.8%

Page 5: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

5

Page 6: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

6

#include <stdlib.h>#include <stdio.h>#include <ctype.h>int main() {

int tot_chars = 0; /* total characters */int tot_lines = 0; /* total lines */ int tot_words = 0; /* total words */ int boolean;/* EOF == end of file */int n; while ((n = getchar()) != EOF) {

tot_chars++; if (isspace(n) && !isspace(getchar())) {

tot_words++; } if (n == '¥n') {

tot_lines++; } if (n == '-') {

tot_words--; }

} printf("Lines, Words, Characters¥n");printf(" %3d %3d %3d¥n", tot_lines, tot_words, tot_chars);return 0;

}

Page 7: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Similarity of Code Snippets

7

Page 8: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Program A Program B

Library C

∂∂∂

∂∂

∂∂

System X System Y System Z

∂ ∂

8

Page 9: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

9

Matters?

Page 10: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Program A Program B

Library C

∂∂∂

∂∂

∂∂

System X System Y System Z

∂ ∂

10

Page 11: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Program A11

Program A’

Program A’’

Page 12: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

12

Yes, code similarity matters!

Page 13: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

13

Our Code Clone Research

Page 14: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Code Clone• A code fragment with an identical or

similar code fragment• Created

– Accidentally: small popular idiomse.g., if(fp=fopen(“file”,”r”)==NULL){

– Intentionally: copy & paste practice

14

copy & paste

copy & paste

Page 15: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

15

Page 16: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Issue at Company X• Maintained code clones by hand

∂∂

∂∂∂ ∂

∂∂∂

∂∂

∂∂∂

20 years ?Hand-made clone list

Check

16

M-LOC system

Page 17: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

17

No tools available in ‘99

Page 18: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

18

Page 19: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Clone Detection Policies• Determine detection granularity

– char / token / line / block /function / ...• Cut off small clones (probably by non-

intentional idioms)• Find all possible clone sets at the

same time• Use efficient and scalable algorithm

19

Page 20: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Classification of Clones

• Type 1: Exact copy, only differences in white space and comments

• Type 2: Same as type 1, but also variable renaming

• Type 3: Same as type 2, but also deleting, adding, or changing few statements

• Type 4: Semantically identical, but not necessarily same syntax.

20

Page 21: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Example of Clones (type 2)

21

Page 22: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University22

A Clone Detection ProcessSource files

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

Clone pairs

1. static void foo() throws RESyntaxException {2. String a[] = new String [] { "123,400", "abc", "orange 100" };3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+");4. int sum = 0;5. for (int i = 0; i < a.length; ++i)6. if (pat.match(a[i]))7. sum += Sample.parseNumber(pat.getParen(0));8. System.out.println("sum = " + sum);9. }

10. static void goo(String [] a) throws RESyntaxException {11. RE exp = new RE("[0-9,]+");12. int sum = 0;13. for (int i = 0; i < a.length; ++i)14. if (exp.match(a[i]))15. sum += parseNumber(exp.getParen(0));16. System.out.println("sum = " + sum);17. }

static void foo ( ) { String a

[ ] = new String [ ] { "123,400" ,

"abc" , "orange 100" } ;

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

sum

+= pat . getParen 0

; System . out . println ( "sum = "

+ sum ) ; }

throws RESyntaxException

Sample . parseNumber (

) )

if pat

. match a [ i ]( ) )

org . apache . regexp

. RE pat = new org . apache . regexp

. RE ( "[0-9,]+" ) ;

static void goo (

) {

String

a [ ]

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

System . out . println ( "sum = " + sum

) ; }

throws RESyntaxException

if exp

. match a [ i ]( ) )

exp =

new RE ( "[0-9,]+" ) ;

(

RE

sum

+= exp . getParen 0

;

parseNumber ( ) )(

(

(

[ ] = new String [ ] {

} ;

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

sum

+= pat . getParen 0

; System . out . println ( "sum = "

+ sum ) ; }

Sample . parseNumber (

) )

if pat

. match a [ i ]( ) )

pat = new

RE ( "[0-9,]+" ) ;

static void goo (

) {

String

a [ ]

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

System . out . println ( "sum = " + sum

) ; }

throws RESyntaxException

if exp

. match a [ i ]( ) )

exp =

new RE ( "[0-9,]+" ) ;

(

RE

sum

+= exp . getParen 0

;

parseNumber ( (

(

(

static void foo ( ) { String athrows RESyntaxException

$

RE

$ . ) )

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

[ ] = new String [ ] {

} ;

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

sum

+= pat . getParen 0

; System . out . println ( "sum = "

+ sum ) ; }

Sample . parseNumber (

) )

if pat

. match a [ i ]( ) )

pat = new

RE ( "[0-9,]+" ) ;

static void goo (

) {

String

a [ ]

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

System . out . println ( "sum = " + sum

) ; }

throws RESyntaxException

if exp

. match a [ i ]( ) )

exp =

new RE ( "[0-9,]+" ) ;

(

RE

sum

+= exp . getParen 0

;

parseNumber ( ) )(

(

(

static void foo ( ) { String athrows RESyntaxException

$

RE

$ .

[ ] = [ ] {

} ;

=

; for ( = ; <

. ; ++ )

+= .

; . . (

+ ) ; }

. (

) )

if

. [ ]( ) )

=

( ) ;

static (

) {[ ]

=

; ( = ; <

. ; ++ )

. . ( +

) ; }

throws

if

. [ ]( ) )

=

new ( ) ;

(

+= .

;

( ) )(

(

(

static $ ( ) {throws

$

$ .

$ $ $ $

$ $

$ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

new

forfor

new

[ ] = [ ] {

} ;

=

; for ( = ; <

. ; ++ )

+= .

; . . (

+ ) ; }

. (

) )

if

. [ ]( ) )

=

( ) ;

static (

) {[ ]

=

; ( = ; <

. ; ++ )

. . ( +

) ; }

throws

if

. [ ]( ) )

=

new ( ) ;

(

+= .

;

( ) )(

(

(

static $ ( ) {throws

$

$ .

$ $ $ $

$ $

$ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

1. static void foo() throws RESyntaxException {2. String a[] = new String [] { "123,400", "abc", "orange 100" };3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+");4. int sum = 0;5. for (int i = 0; i < a.length; ++i)6. if (pat.match(a[i]))7. sum += Sample.parseNumber(pat.getParen(0));8. System.out.println("sum = " + sum);9. }

10. static void goo(String [] a) throws RESyntaxException {11. RE exp = new RE("[0-9,]+");12. int sum = 0;13. for (int i = 0; i < a.length; ++i)14. if (exp.match(a[i]))15. sum += parseNumber(exp.getParen(0));16. System.out.println("sum = " + sum);17. }

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

Page 23: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Match Detection Algorithms

23

x

y

z%

%

xyxyz%

y

xyz%

z%

xyz%

z%

1

2

43

56

71 2 3 4 5 6 7x x y x y z %

1 2 3 4 5 6 7x x y x y z %

1 x *2 x * *3 y *4 x * * *5 y * *6 z *7 % *

***

*

1 2 3 4 5 6 7x x y x y z %

Suffix Tree Matching Table

Page 24: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Initial Result• Initial prototype had been

implemented in a few weeks• Limited scalability

– Refined & tuned– Bought powerful workstation and expensive memory

24

Page 25: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Clones between Unix Kernels

25

FreeBSD

Linux

NetBSD

FreeBSD Linux NetBSD

Page 26: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Let’s Submit to IEEE TSE !CCFinder:amultilinguistictoken-basedcodeclonedetectionsystemforlargescalesourcecode,TKamiya,SKusumoto,KInoue,IEEETransactionsonSoftwareEngineering,Vol.28,No.7,pp.654-670,2002.

•Initial submission in July, 2000•Major revision request in Jan., 2001•Minor revision request in Aug., 2001•Accepted in Sept., 2001•Published in July, 2002

26

1,413 citations (2017, Nov. 3)22nd highly-cited paper in SE

Page 27: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Extending CCFinder• Prototype was extended, tuned, and

refined as a practical tool• Target languages

– Initially C– Adapted to C++, COBOL, Java, Lisp, Text

• Added GUI and result visualizer

27

Page 28: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Promoting Code Clone Technology• Collaboration with many companies

– NTT, Fujitsu, Hitachi, Samsung, Microsoft Research, ...

• International Workshop onSoftware Clones IWSC (’02~)

• Code clone seminars (9 times, ’02~’07)

28

Page 29: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

> 5,000 downloads

Page 30: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University30

Page 31: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University31

Page 32: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

32

Page 33: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

33

0

200

400

600

800

1000

1200

1400

1600

1800

~00 01~05 06~10 11~15 16~

Papers Related to Code Clone

Page 34: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

34

Applying to New Fields

Page 35: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Compiler Construction Class

S1

S1

S3

S3 S4

S4

S2

S2

S5

S5

35

Page 36: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Compiler Construction Class

S1

S1

S3

S3 S4

S4

S2

S2

S5

S5

36

Page 37: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Law Suit of Lace-making Machines

37

A B

Page 38: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

38Clones in near systems

Page 39: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

System Evolution Analysis— BSD Unix Clustering —

Similarity Measure ß39

Cover Ratio

Page 40: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Chasing Scalability• CCFinder input limitation: A few M LOC due to the

suffix tree algorithm --- on core• Code clone detection is embarrassingly parallel

problem --- divide and concur

Distributed CCFinder with 80 lab machines 40

Page 41: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

FreeBSD Ports Collection / 136 Versions of Linux Kernel

Livieri, S., Higo, Y., Matsushita, M., Inoue, K., “Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder“, ICSE2007.

136 Ver. Linux Kernels (7.4GB, 260M LOC in C)

FreeBSD Ports Collection(10.8GB, 400M LOC in C)

41

Page 42: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

42Searching for clone pairs

Page 43: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

34th International Conference on Software Engineering, pp.331-341, Zurich, Switzerland, 2012-05.

Page 44: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Developer’s Concerns

44

Existing Project

New Project

Developer

Reusable?

• Origin- Who?

- When?

- License?

- Copyright?

• Evolution- Maintenance?

- Popularity?

- Newer version?

Code Fragment

To ease concerns, a support system is needed

Page 45: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Code History Tracking System

45OSS Repositories

Code Fragmentin Question ?

License: X /Copyright: Y

Copy Source

Ancestor

Modify

Modify Copy

Time

License: X’Copyright Y’

DescendantsProj. C

Proj. A

Proj. B

Proj. D

Proj. E

Proj. F

Proj. G

Proj. H

License: XCopyright: Y

Page 46: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

System Overview

46

Input Query Q

Integrated CodeHistory Tracker

Ichi Tracker

Output Results R

Code Fragment

Internet

Code Fragment

Search Query SQ

SearchResults SR

Code Attributes(Optional)

Code Attributes attribute 1: ...

attribute 2: ......

attribute 1: ...attribute 2: ...

...

attribute 1: ...attribute 2: ...

...

attribute 1: ...attribute 2: ...

...

qc

Open Source Repositories

Code Search Engines

SPARS/R KodersGoogleCodeSearch

Code Clones

Page 47: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

: Evolution of jMonkeyEngine Project

: Cluster of Same or Similar Files

: File in New BSD License

: File in zib-libpng License

0.4

0.5

0.6

0.7

0.8

0.9

1

2006/10/10 2007/04/28 2007/11/14 2008/06/01 2008/12/18 2009/07/06 2010/01/22 2010/08/10 2011/02/26

CoverR

atio

LastModifyTime

18

1617

4 8

1

1412

10

11

19 21

3

7

2

23 2422 262015

65 9 25

13

A

B

C

1 Jmonkeyengine r3448 (K)

2 Simplexe (G)

3 Simplexe (K)

4 Jmonkeyengine r3800 (G)

5 The-project08 (G)

6 Tank CombatGame (K)

7 wrathofthetaboos (G)

8 wrathofthetaboos (K)

9 Lasthaven (G)

10 Jme-cotk (G)

11 Ardor3D (G)

12 Jmonkeyengine r4099 (G)

13 Fairytale-soulfire (G)

14 Jmonkeyengine r4490 (G)

15 Jmonkeyengine r4490 (S)

16 Xenogeddon (G)

17 Deathsquadrendezvous (G)

18 Partiendolapana (G)

19 Tholos (G)

20 Ardor3D (G)

21 Cosmic-engine (G)

22 Fregatclient3d (G)

23 Footballmanagerdesia (G)

24 Multiplicity (G)

25 Jmerefactoring (G)

26 Java3dfh (G)

Evolution Pattern of Texture.java

jMonkeyEngine R3800

R3448

R4099

Page 48: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

0.4

0.5

0.6

0.7

0.8

0.9

1

1993/01/31 1995/10/28 1998/07/24 2001/04/19 2004/01/14 2006/10/10 2009/07/06 2012/04/01

CoverR

atio

Last modifiedtime

26

2

4950

47 48

1 3 4

5 6

7

8

10

9

11

12

13

14

15

1617

18

19

24

25

27

28-3334

36

37-46

20-23

35

51

52 53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

1 Lites 1.0 (G) 28-33 Kame (G)

2KernelSourceArchive- CMUMach3.0 (K)

34-36 SimOS(K)

3 Lites 1.1.u3 (G) 27-46 Kame (G)

4 Lites 1.1-950808 (G) 47 Netnice (G)

5 TheRio(RAMI/O)Project (K) 48 Kame (G)

6 ftpinTheUniversityofEdinburgh (G) 49-50 Psumip (G)

7 Mip-summer98 (G) 51 Netnice (G)

8 freeBSD/SPARC (G) 52 Reflexprotocol (G)

9-12 ftpinStockholmUniversity (G) 53 Netnice (G)

13 freeBSD-cam2.1.5R (G) 54 NetBSD v1.105 (K)

14-15 SonicOSX(K)

55 OpenBSD PVXen(G)

16 LabyrinthBSD(labyrinthos) (K) 56 OpenBSD v1.73 (K)

17 Oskit (G) 57 Pmon (G)

18 Psumip(G)

58-62Proyecto A.T.L.D.GNU/hurd(extremelinux) (K)

19 Mach (G) 63 OpenBSD v1.74 (K)

20-22 Savannah (G) 64 Pmon (G)

23 UnofficialOSKitsource (K) 65 774 (G)

24-26 UnofficialOSKitsource(oskit) (K) 66 Chord-ns3 (G)

27 ftpinStockholmUniversity (G) 67 Openbsd-loongson-

vc (G)

ResultsbyG(GoogleCodeSearch)andK(Koders)

: File in Original BSD License

: File in New BSD License

Evolution Pattern of kern_malloc

2

Page 49: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Current and Future

Page 50: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

OSS Dependency

50

∂∂

∂∂

OSS System X

Library Q

Library P

Clone A

Clone B

System Y

System Z

Import

Export

Paste

Copy

Page 51: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

yeoman-ass...

hubot

istanbul

mochaunderscore

jshint-sty...

grunt-cont...

grunt-cont...

time-grunt

load-grunt...

grunt-cont...

inquirer

commander

grunt

chai

colors

express

tap

request

assert

extend

node-uuid

should

bluebird

async

coffee-scr...

nan

fs-extra

graceful-fs

rimraf

chalk

yosay

yeoman-gen...

jsdom

meow

ava

xo

mime

jshint

standard

through2

minimist

debug

browserify

grunt-cont...

grunt-cont...

grunt-cont...

grunt-cont...grunt-cont...

grunt-cont...

grunt-cont...

ejs

coveralls

mocha-lcov...

babel

babel-core

q

gulp-uglify

vinyl-sour...

gulp-watch

gulp

gulp-rename

clone

code

sinon

lab

winston

promise

qs prompt

nodemailer

ws

connect

README.md

globmkdirp

uglify-js

grunt-cont...

stylus

jade

nock

marked

optimist

jasmine-node

watchify

gulp-sourc...

gulp-coffee

gulp-util

del

gulp-mocha

through

redis

coffeelint

resolve

vows

sinon-chai

lodash

grunt-jscs

eslint

cz-convent...

codecov.io

phantomjs

karma-chai

karma

babel-pres...

karma-mocha

karma-phan...

karma-brow...

cheerio

jquery

watch

webpack

moment

babel-cli

gulp-autop...

gulp-minif...

require-dir

gulp-sass

gulp-concat

gulp-if

gulp-clean

gulp-replace

vinyl

grunt-cli

babel-regi...

babel-eslint

eslint-plu...inherits

minimatch

less

chai-as-pr...

benchmark

tape

event-stream

semver

concat-str...

pg

uuid

eslint-plu...

eslint-con...

xtend

readable-s...

zuul

socket.io-...

body-parser

socket.io

bower

tmp

tap-spec

yargs

gulp-plumber

babel-pres...

gulp-babel

babel-loader

babel-pres...

gulp-load-...

babel-poly...

react

run-sequence

mongodb

grunt-brow...

grunt-moch...

grunt-karma

nodemon

path

mustache

npm

supertest

power-assert

faucet

grunt-bump

object-ass...

expect.js

jscs

pre-commit

nodeunit

hoek

hapi

mongoose

blanket

grunt-shell

matchdep

grunt-rele...

joi

sqlite3

handlebars

nyc

rsvp

karma-moch...

es6-promise

karma-fire...

webpack-de...

karma-sour...

karma-webp...

grunt-eslint

gulp-jshint

gulp-bump

gulp-eslint

gulp-filter

gulp-istan...

jsdoc

esprima

js-yaml

shelljs

brfs

proxyquire

rewire

babel-runt...

aws-sdk

mockery

temp

karma-chro...

co

source-map...

cli-color

escodegen

bunyan

gulp-jscs validator

xml2js

when

http-proxy

cli-table

babelify

babel-plug...

browser-sync

eslint-plu...

mysql

isparta

null

gulp-cover...

react-router

react-dom

eslint-loa... babel-pres...

react-addo...gulp-exclu...

gulp-nsp

example

jasmine

ramda

cookie-par...

karma-jasm...

requirejs

jasmine-core

underscore...

karma-cove...

compression

highlight.js

json-loader

immutable

sass-loader

node-sass

css-loader

extract-te...

style-loaderreact-hot-...

clean-css

passport

morgan

superagent

update-not...

gulp-less

gulp-header

codeclimat...

chokidar

serve-static

merge

grunt-coff...

backbone

autoprefixer

postcss

express-se...open

angular-mo...

grunt-simp...

bootstrap

classnames

file-loader

url-loader babel-pres...

babel-plug...

expect

vinyl-buffergrunt-moch...

typescript

ncp

ember-cli

babel-plug...

node-libs-...gulp-shell

jest-cli

semantic-r...

angular

react-native

ghooks

react-redux

redux

phantomjs-...

http-server

gulp-notify

wrench

gulp-liver...

yo

ember-cli-...

ember-disa...

ember-try

ember-cli-...

ember-expo...

ember-disa...

broccoli-a...

ember-cli-...

ember-cli-...

ember-cli-...

ember-data

ember-cli-...

ember-cli-...

ember-cli-...

ember-cli-...

ember-cli-...

ember-cli-...ember-cli-...

argx

ape-report...

ape-tasking

coz

ape-updatingape-covering

ape-testing

ape-tmpl

ape-releas...

injectmock

<%= appNam...

ember-reso...

<%= name %>

everything

51

Page 52: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

52Raula Gaikovina Kula, Daniel M. German, Ali Ouni, Takashi Ishio, Katsuro Inoue, Do developers update their library dependencies? An empirical study on the impact of security advisories on library migration, EMSE accepted & FSE2017 J. 1st.

Page 53: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

53

Page 54: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

54

Page 55: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University55

OSS Repository

UserDeveloper

OSS Repository

Depend

Depend

UseContribute

Project Project

OSS Ecosystem

Page 56: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Summary

56

Page 57: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

57

Finding similar code in softw1are matters

Page 58: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

58

Code clone analysis became very popular SE technology

Page 59: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Analyzing code similarity is key to know OSS provenance and evolution

Page 60: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions
Page 61: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

Exploring OSS universewith code similarity analysis

Page 62: ICECCS2017 Exploring Similar Code · CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions

62

Thank you