Upload
nonnie
View
45
Download
0
Embed Size (px)
DESCRIPTION
DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool. Simone Livieri Yoshiki Higo Makoto Matsushita Katsuro Inoue. Background. Open-Source Software (OSS) is used in many software systems Relations between software systems can be exposed through code clone analysis - PowerPoint PPT Presentation
Citation preview
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool
Simone LivieriYoshiki Higo
Makoto MatsushitaKatsuro Inoue
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Background
• Open-Source Software (OSS) is used in many software systems
• Relations between software systems can be exposed through code clone analysis
• Large collections of OSS exist• Huge memory requirements, long running time
• Computing power is cheap• Large number of computers are often easy
accessible• Code clone analysis can be distributed
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
In the beginning was CCFinder• CCFinder is a code-clone analysis tool• Widely used and cited• Token based• Many languages supported (e.g. C, C++,
Java)• Good scalability (but can’t handle very
large input)
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
DCCFinder
• D(istributed)CCFinder is a tool for distributed code clone analysis
• Master-slave distributed system• Data sharing through a shared file system• Uses CCFinder to perform the code clone
analysis• The prototype ran on 80 computers of the
Student Laboratory of our department
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Computational ModelComputational Model
target
category 1 category 4category 2 category 3
project 1 project 2 project 3 project 4 project 5 project 6 project 7 project 8
unit 1 unit i-1 unit i unit i+1 unit j-1 unit j unit j+1 unit n
Target is the set of source file
undergoing code clone analysis
A category is a set of source file sharing a specific feature or
use
A project is a single software systemA unit is a set of
source files that may cross multiple
projects
Piecei,j
unit j
unit
i
CCFinder
Slave NodeTwo units make a
piece. A piece is the collection of file that will be analyzed on
each slave node
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
System Implementation (1)
• Written in Java (about 20kLoc) • Master-Slave-Registry communication handled
with Java RMI• Basic fault tolerance
Master and slave node characteristics
Processor Pentium IV 3GHz
Memory 1 GBytes
Network Link Gigabit Ethernet connected to 100 MBit/s network hubs
OS FreeBSD 5.3-STABLE
Local Storage 40~50 GBytes
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Analysis Process
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
System Implementation (2)• Indexer
• Examines the target and collect file size, LoC, project and category name
• Computes unit boundaries• Master Node
• Creates the input files for CCFinder and assigns jobs to the slaves• Slave Node
• Copies the files on the local storage• Executes CCFinder• Copies the output to the shared storage
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
System Implementation (2)• Indexer
• Examines the target and collect file size, LoC, project and category name
• Computes unit boundaries• Master Node
• Creates the input files for CCFinder and assigns jobs to the slaves• Slave Node
• Copies the files on the local storage• Executes CCFinder• Copies the output to the shared storage
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
System Implementation (3)
• Clone Coverage Analyzer• Compute the number of shared line of code between each pair
of files, projects and categories
• Image Generator• Generate scatter plot, heat maps or bar chart from the clone
coverage data
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
System Implementation (3)
• Clone Coverage Analyzer• Compute the number of shared line of code between each pair
of files, projects and categories
• Image Generator• Generate scatter plot, heat maps or bar chart from the clone
coverage data
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: The FressBSD Target• Vast collection of
Open-Source software used by the FreeBSD OS
• Unit size: 15MBytes• Minimum code clone
length: 50 tokens• Total number of
tasks: 269,745
Number of categories 45
Number of projects 6658
Number of .c files 754.552
Total line of code 403,625,067
Total size 10.8GBytes
Time elapsed
Indexer 22 minutes
D-CCFinder 51 hours
Scatter plot
Clone Coverage Analyzer
23 hours
Image Generator 4 hours
Total 78 hours 22 minutes
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
php4 and php5 duplicated source tree
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
gstream’s main source tree is
duplicated inside all the gstream plugin
projects
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Multiple copies of the X-Windows System
source tree
Case Study I: Result
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: Result
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study I: ResultDatabase CategoryCCC1: 41%Causes:•Different version of the same software•Database drivers for different languages•Multiple copies of the phpX source tree
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Development CategoryCCC1: 38%Causes:•Mainly the presence of different versions of the GNU binary utilities and compilers
Case Study I: Result
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Lang and Development CategoriesCCC1: 28%Causes:•The presence in both categories of the suite of GNU compilers
Case Study I: Result
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
X11 Fonts CategoryCCC1: 46%Causes:•Small category size•Seven copies of the X Window System source tree
Case Study I: Result
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study II: SPARS-J and the FressBSD Target• SPARS-J is a Java component analysis
tool• About 47000 line of code; written in C• Code clones between the SPARS-J and
the whole FreeBSD target were detected
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study II: Code Clone Coverage (before)
Most of the code clones were from a single file: getopt.c
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Case Study II: Code Clone Coverage (after)
• Code clones from CGI handling source code• Specialized version of getopt.c
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Summary
• Proposed a new approach to distributed large scale code clone analysis
• Obtained a global overview of code clones in the FreeBSD target
• In SPARS-J, effortlessly individuated the use of code from the FreeBSD target
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Summary (2)
• The acceleration gain was 20. Limited by:• data transfer, network congestion, master-slave
coordination
• Generating of reasonable size scatter-plot traded speed for accuracy. Effects:• Source code organization easily visible, enhanced
artifacts, finer details not distinguishable
• Currently can’t efficiently filter unnecessary or not-so-interesting code clones• Being addressed by exploring fingerprint based
source code analysis
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Future Work
• Currently D-CCFinder is being rewritten• Better fault tolerance• GUI Interface• Distributed post processing and image
generation
• Exploring the evolution of different software systems with code clone analysis
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Metrics
100)()(
)),((),(1
10
1010
10
MLOCMLOC
MMCCLOCMMCCC MM
100)(
))((),(2
0
010
10 MLOC
MCCLOCMMCCC MM
10 ,MM
),( 1010MMCC MM
)( 010MCC MM
)(xLOC
A pair of files or projects or categoriesA pair of files or projects or categories
Segments of the cone clones between MSegments of the cone clones between M00 and M and M11
Segments of the cone clones between MSegments of the cone clones between M00 and M and M11 in M in M00
Number of lines of code in Number of lines of code in xx
CCC1 is the percentage of shared line of code between M0 and M1
computed over the total line of code of M0 and M1
CCC2 is the percentage of line of code that M0
shares with M1 computed over the total line of code
of M0