28
Department of Computer Science, Graduate School of Information Science and Technology, Osaka Universi DCCFinder: A Very- Large Scale Code Clone Analysis and Visualization Tool Simone Livieri Yoshiki Higo Makoto Matsushita Katsuro Inoue

DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

  • Upload
    nonnie

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool. Simone Livieri Yoshiki Higo Makoto Matsushita Katsuro Inoue. Background. Open-Source Software (OSS) is used in many software systems Relations between software systems can be exposed through code clone analysis - PowerPoint PPT Presentation

Citation preview

Page 1: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Simone LivieriYoshiki Higo

Makoto MatsushitaKatsuro Inoue

Page 2: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Background

• Open-Source Software (OSS) is used in many software systems

• Relations between software systems can be exposed through code clone analysis

• Large collections of OSS exist• Huge memory requirements, long running time

• Computing power is cheap• Large number of computers are often easy

accessible• Code clone analysis can be distributed

Page 3: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

In the beginning was CCFinder• CCFinder is a code-clone analysis tool• Widely used and cited• Token based• Many languages supported (e.g. C, C++,

Java)• Good scalability (but can’t handle very

large input)

Page 4: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

DCCFinder

• D(istributed)CCFinder is a tool for distributed code clone analysis

• Master-slave distributed system• Data sharing through a shared file system• Uses CCFinder to perform the code clone

analysis• The prototype ran on 80 computers of the

Student Laboratory of our department

Page 5: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Computational ModelComputational Model

target

category 1 category 4category 2 category 3

project 1 project 2 project 3 project 4 project 5 project 6 project 7 project 8

unit 1 unit i-1 unit i unit i+1 unit j-1 unit j unit j+1 unit n

Target is the set of source file

undergoing code clone analysis

A category is a set of source file sharing a specific feature or

use

A project is a single software systemA unit is a set of

source files that may cross multiple

projects

Piecei,j

unit j

unit

i

CCFinder

Slave NodeTwo units make a

piece. A piece is the collection of file that will be analyzed on

each slave node

Page 6: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

System Implementation (1)

• Written in Java (about 20kLoc) • Master-Slave-Registry communication handled

with Java RMI• Basic fault tolerance

Master and slave node characteristics

Processor Pentium IV 3GHz

Memory 1 GBytes

Network Link Gigabit Ethernet connected to 100 MBit/s network hubs

OS FreeBSD 5.3-STABLE

Local Storage 40~50 GBytes

Page 7: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Analysis Process

Page 8: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

System Implementation (2)• Indexer

• Examines the target and collect file size, LoC, project and category name

• Computes unit boundaries• Master Node

• Creates the input files for CCFinder and assigns jobs to the slaves• Slave Node

• Copies the files on the local storage• Executes CCFinder• Copies the output to the shared storage

Page 9: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

System Implementation (2)• Indexer

• Examines the target and collect file size, LoC, project and category name

• Computes unit boundaries• Master Node

• Creates the input files for CCFinder and assigns jobs to the slaves• Slave Node

• Copies the files on the local storage• Executes CCFinder• Copies the output to the shared storage

Page 10: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

System Implementation (3)

• Clone Coverage Analyzer• Compute the number of shared line of code between each pair

of files, projects and categories

• Image Generator• Generate scatter plot, heat maps or bar chart from the clone

coverage data

Page 11: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

System Implementation (3)

• Clone Coverage Analyzer• Compute the number of shared line of code between each pair

of files, projects and categories

• Image Generator• Generate scatter plot, heat maps or bar chart from the clone

coverage data

Page 12: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Case Study I: The FressBSD Target• Vast collection of

Open-Source software used by the FreeBSD OS

• Unit size: 15MBytes• Minimum code clone

length: 50 tokens• Total number of

tasks: 269,745

Number of categories 45

Number of projects 6658

Number of .c files 754.552

Total line of code 403,625,067

Total size 10.8GBytes

Time elapsed

Indexer 22 minutes

D-CCFinder 51 hours

Scatter plot

Clone Coverage Analyzer

23 hours

Image Generator 4 hours

Total 78 hours 22 minutes

Page 13: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Case Study I: Result

Page 14: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Case Study I: Result

php4 and php5 duplicated source tree

Page 15: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Case Study I: Result

gstream’s main source tree is

duplicated inside all the gstream plugin

projects

Page 16: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Multiple copies of the X-Windows System

source tree

Case Study I: Result

Page 17: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Case Study I: Result

Page 18: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Case Study I: ResultDatabase CategoryCCC1: 41%Causes:•Different version of the same software•Database drivers for different languages•Multiple copies of the phpX source tree

Page 19: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Development CategoryCCC1: 38%Causes:•Mainly the presence of different versions of the GNU binary utilities and compilers

Case Study I: Result

Page 20: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Lang and Development CategoriesCCC1: 28%Causes:•The presence in both categories of the suite of GNU compilers

Case Study I: Result

Page 21: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

X11 Fonts CategoryCCC1: 46%Causes:•Small category size•Seven copies of the X Window System source tree

Case Study I: Result

Page 22: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Case Study II: SPARS-J and the FressBSD Target• SPARS-J is a Java component analysis

tool• About 47000 line of code; written in C• Code clones between the SPARS-J and

the whole FreeBSD target were detected

Page 23: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Case Study II: Code Clone Coverage (before)

Most of the code clones were from a single file: getopt.c

Page 24: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Case Study II: Code Clone Coverage (after)

• Code clones from CGI handling source code• Specialized version of getopt.c

Page 25: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Summary

• Proposed a new approach to distributed large scale code clone analysis

• Obtained a global overview of code clones in the FreeBSD target

• In SPARS-J, effortlessly individuated the use of code from the FreeBSD target

Page 26: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Summary (2)

• The acceleration gain was 20. Limited by:• data transfer, network congestion, master-slave

coordination

• Generating of reasonable size scatter-plot traded speed for accuracy. Effects:• Source code organization easily visible, enhanced

artifacts, finer details not distinguishable

• Currently can’t efficiently filter unnecessary or not-so-interesting code clones• Being addressed by exploring fingerprint based

source code analysis

Page 27: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Future Work

• Currently D-CCFinder is being rewritten• Better fault tolerance• GUI Interface• Distributed post processing and image

generation

• Exploring the evolution of different software systems with code clone analysis

Page 28: DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Metrics

100)()(

)),((),(1

10

1010

10

MLOCMLOC

MMCCLOCMMCCC MM

100)(

))((),(2

0

010

10 MLOC

MCCLOCMMCCC MM

10 ,MM

),( 1010MMCC MM

)( 010MCC MM

)(xLOC

A pair of files or projects or categoriesA pair of files or projects or categories

Segments of the cone clones between MSegments of the cone clones between M00 and M and M11

Segments of the cone clones between MSegments of the cone clones between M00 and M and M11 in M in M00

Number of lines of code in Number of lines of code in xx

CCC1 is the percentage of shared line of code between M0 and M1

computed over the total line of code of M0 and M1

CCC2 is the percentage of line of code that M0

shares with M1 computed over the total line of code

of M0