View
9.361
Download
0
Category
Tags:
Preview:
Citation preview
CLARKSON UNIVERSITY
MANAGING THE COPY-AND-PASTE PROGRAMMING PRACTICE
A Dissertation
By
Patricia Deshane
Coulter School of Engineering
Submitted in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy, Engineering Science
April 30, 2010
Accepted by the Graduate School
______________, _____________________ Date, Dean
Copyright 2010, Patricia Deshane
The undersigned have examined the dissertation entitled “Managing the Copy-and-
Paste Programming Practice” presented by Patricia Deshane, a candidate for the degree of Doctor of Philosophy (Engineering Science), and hereby certify that it is worthy of acceptance. April 30, 2010 ______________________________ Date Dr. Daqing Hou, Advisor Electrical and Computer Engineering
______________________________ Dr. Susan Conry, Examining Committee
Electrical and Computer Engineering
______________________________ Dr. Christopher Lynch, Examining Committee
Mathematics & Computer Science
______________________________ Dr. Robert Meyer, Examining Committee
Electrical and Computer Engineering
______________________________ Dr. Christino Tamon, Examining Committee
Mathematics & Computer Science
iv
Abstract
CLARKSON UNIVERSITY
Managing the Copy-and-Paste Programming Practice
By: Patricia Deshane
Advisor: Daqing Hou
Programmers often copy and paste source code in order to reuse an existing
solution in the completion of a current task. Copying and pasting results in code clones
(similar code fragments) throughout a code base, which need to be properly maintained
over time. Forgetting the cloning information and correspondence relationships within a
piece of code can be problematic for the software maintainer. Furthermore, inconsistent
editing to clones can introduce undetected bugs, decreasing the quality of the software.
This dissertation presents a suite of software tools, Eclipse plug-ins named CnP,
that aid the programmer during copy, paste, and modify programming. The purpose is to
provide tool support throughout a clone’s entire lifecycle, from its creation to its removal
from the system. More than just traditional clone detection and removal, these clone
tracking tools have a particular focus on clone editing. One CnP plug-in helps with
consistent identifier renaming within clones (CReN), another one renames substrings
consistently within clones (LexId), and a third plug-in in the CnP suite visualizes user
edits within a clone for better clone comparison (CSeR). A user study was conducted on
CnP’s basic visualization, CReN, and LexId features with analysis in terms of task
completion time, solution correctness, and method of completion.
To my wonderful husband, Todd Deshane.
vi
Acknowledgments Personal Reflections
With the completion of this dissertation paper and defense, I feel like I have had
the full PhD experience, and I am finally personally ready to graduate! I have completed
the course work, the qualifying exam, and the research component of the degree program.
Over the years, I have given various seminar presentations in the computer science and
engineering departments at the university, I have presented at one conference per year
during the research phase of my PhD (OOPSLA 2007 in Montreal, FSE 2008 in Atlanta,
and CASCON 2009 in Toronto), and I have written many paper submissions and drafts.
A conference trip was often a reward for having a successful paper submission.
All three conference trips that I presented at were milestones during my PhD career and
each trip was truly unforgettable. I really enjoyed traveling to these cities and learning
from other researchers in the software engineering discipline. As a presenter, I personally
got a lot of valuable feedback and advice from the conference attendees that I may not
have gotten otherwise. I believe that the conferences were a vital part of my growth as a
researcher as they helped me to “get out of the lab” and experience the rest of the world.
Special Thanks
I would first like to thank my husband, Todd Deshane, for everything over the
past ten years that we have been together. I would not be where I am today without him.
When it seems like the only thing that is constant is change, it is comforting to know that
Todd is always there to help me through the tough times and to celebrate with me in the
joyous times. Todd, all of this hard work during our PhD years was worth it – soon we
vii
will both officially be “computer doctors”. “To every thing there is a season, and a time
to every purpose under heaven” (Ecclesiastes 3:1). I look forward to beginning the next
chapter of our lives together.
I would like to thank Professor Hou, my advisor, who took me in as his first PhD
student at a time when I had nowhere else to go. I thank him for having put up with me
for the past four years. I have learned so much from him, most importantly, I believe that
“instead of just being fed, I was taught how to fish”. For the first time, I was able to get
the guidance that I needed, with the independence that I wanted.
Thanks also to the Software Engineering Research Laboratory (SERL) at
Clarkson. I am truly grateful for having the lab/office space and resources that allowed
me to work more productively. I thank the other graduate students in SERL, especially:
Cheng (Jerry) Wang, Chandan Rupakheti, Ferosh Jacob, Yuejiao (Gloria) Wang, Xiaojia
(Joanna) Yao, Dave Pletcher, and Lin Li. I really appreciate the friendship and support
from each of you as we all spent countless hours in the lab.
I would also like to thank Jeanna Matthews for supporting me during the first year
and a half of my PhD. Thanks also go to Eli M. Dow, my mentor at IBM, who helped me
with early research and has continued to be supportive. Finally, thank you, my PhD
committee (Susan Conry, Christopher Lynch, Robert Meyer, and Christino Tamon), who
have given me early feedback during my PhD proposal and remarks on this dissertation.
Personal thanks to all of my family – my parents and siblings – extended family,
and best friends both from college and from back home. Special thanks to my best buddy,
Wenjin Hu, for his never-ending kindness and friendship, while here at school. I would
also like to mention my other “best friend”, my dog, Lady. I truly do miss her.
viii
I give special thanks to my grandfather, B. John Jablonski, who continually kept
me motivated during my college years. It is hard to believe that it has already been four
years since his death, but I know that it has been that long since I have gotten an email
letter from him. His letters and emails always meant so much to me, with his words of
encouragement when many others did not approve or understand why I was still in school
especially without proper funding. He continues to be my inspiration.
Finally, I thank God (literally). He is everything – my counselor, comforter, and
keeper. In particular, I would like to thank God for always keeping me grounded. One of
the toughest things that I experienced during my PhD is rejection (of paper submissions).
While I feel that the rejections may have at times hindered my research progress, I feel
that if all of my paper submissions were accepted, then I may have incorrectly assumed
that this process was very simple and I may have become too proud of my own successes.
During this whole ordeal, I have learned that there is always room for improvement even
if it is difficult for me to see on my own. As Randy Pausch said, “Experience is what you
get when you didn’t get what you wanted.” But, regardless of past rejections, I know that
I have always done my best and that “with God all things are possible” (Matthew 19:26).
Ultimately, my success is not measured by men’s approval, but by God’s*.
Patricia A Deshane
Clarkson University
January 2010
* “Study to show yourself approved to God, a workman that needs not to be ashamed.” (2 Timothy 2:15) Clarkson University Motto
Disclaimer: The views and opinions expressed in this dissertation are solely those of the author and do not necessarily represent the views and opinions of anyone else affiliated with Clarkson University.
ix
Contents
LIST OF TABLES ....................................................................................................................................... XI
LIST OF ILLUSTRATIONS ......................................................................................................................XII
LIST OF PUBLICATIONS....................................................................................................................... XIV
CHAPTER 1 INTRODUCTION.....................................................................................................................1
1.1. COPY, PASTE, AND MODIFY PROGRAMMING.................................................................................1 1.2. THE TRADITIONAL PERSPECTIVE: CLONES ARE BAD ....................................................................3 1.3. A NEW PERSPECTIVE: CLONES CAN BE GOOD...............................................................................8 1.4. RESEARCH CONTRIBUTIONS........................................................................................................10 1.5. OUTLINE OF THIS DISSERTATION ................................................................................................11
CHAPTER 2 LITERATURE REVIEW........................................................................................................13
2.1. CLONE DETECTION AND REMOVAL.............................................................................................13 2.2. CLONE LIFECYCLE MANAGEMENT..............................................................................................16 2.2.1. BRIEF CNP TOOL DESCRIPTIONS.................................................................................................16 2.2.2. DEFINITIONS OF CLONE PROPERTIES...........................................................................................19 2.2.2.1. CLONE SIMILARITY ................................................................................................................20 2.2.2.2. CLONE MODEL .......................................................................................................................24 2.2.2.3. CLONE VISUALIZATION ..........................................................................................................28 2.2.2.4. CLONE PERSISTENCE ..............................................................................................................38 2.2.2.5. CLONE DOCUMENTATION AND CLONE ATTRIBUTES ..............................................................39 2.2.3. CLONE LIFECYCLE SUPPORT .......................................................................................................39 2.2.3.1. CLONE CREATION ..................................................................................................................42 2.2.3.2. CLONE CAPTURE ....................................................................................................................43 2.2.3.3. CLONE EDITING......................................................................................................................47 2.2.3.4. CLONE EXTINCTION ...............................................................................................................66 2.3. PREVALENCE OF CLONES, RENAMING, AND RELATED ERRORS IN PRODUCTION CODE...............68
CHAPTER 3 METHODOLOGY..................................................................................................................73
3.1. USER STUDY ON CNP’S VISUALIZATION, CREN, AND LEXID .....................................................73 3.1.1. USER STUDY HYPOTHESES .........................................................................................................74 3.1.2. SUBJECT CHARACTERISTICS........................................................................................................75 3.1.3. STUDY PROCEDURE ....................................................................................................................76 3.1.4. TASK DESCRIPTIONS ...................................................................................................................79 3.1.4.1. DEBUGGING AND MODIFYING WITHIN A CLONE.....................................................................80 3.1.4.2. RENAMING WITHIN A CLONE..................................................................................................88
CHAPTER 4 RESULTS ...............................................................................................................................93
4.1. TIME PER TASK ...........................................................................................................................93 4.2. SOLUTION CORRECTNESS............................................................................................................96 4.3. METHOD OF COMPLETION...........................................................................................................98
CHAPTER 5 DISCUSSION .......................................................................................................................101
5.1. CONFOUNDING FACTORS FOR CLONE VISUALIZATION..............................................................101 5.2. THREATS TO VALIDITY .............................................................................................................103 5.3. TOOL DESIGN............................................................................................................................103
CHAPTER 6 CONCLUSION .....................................................................................................................105
6.1. RESEARCH CONTRIBUTIONS......................................................................................................106
x
6.2. FUTURE WORK..........................................................................................................................107 6.2.1. THEORY ABOUT COPY-AND-PASTE AND ABSTRACTIONS ..........................................................108 6.2.2. OTHER APPLICATIONS OF THIS RESEARCH ...............................................................................109
REFERENCES............................................................................................................................................110
APPENDIX A IRB RECRUITMENT LETTER.........................................................................................131
APPENDIX B IRB CONSENT FORM ......................................................................................................132
APPENDIX C IRB QUESTIONNAIRE .....................................................................................................134
xi
List of Tables
TABLE 1: SUMMARY OF CLONE TRACKING TOOLS WITH THEIR DEFINITIONS OF CLONE PROPERTIES ...........21
TABLE 2: SUMMARY OF CLONE TRACKING TOOLS WITH THEIR CLONE LIFECYCLE SUPPORT........................41
TABLE 3: EXAMPLES OF WHAT LEXID CONSIDERS TO BE SUBSTRINGS. ..........................................................54
TABLE 4: THREE EXAMPLES FROM LITERATURE THAT SHOW AN INCONSISTENT RENAMING OF IDENTIFIERS IN
THE PASTED CODE FRAGMENT. .............................................................................................................72
TABLE 5: HIGH-LEVEL DESCRIPTION OF THE TASKS IN THE USER STUDY. .......................................................73
TABLE 6: THE TIME (IN MINUTES) TO COMPLETE EACH PAIR OF TASKS...........................................................94
TABLE 7: STATISTICAL HYPOTHESIS TESTING ON THE PAIRED TIME DATA......................................................96
TABLE 8: CORRECT STATES WHEN RUNNING THE PROGRAM OR WHEN FINISHED............................................97
TABLE 9: NUMBER OF SUBJECTS WHO USED EACH LOCATION AND INSPECTION METHOD FOR DEBUGGING AND
MODIFICATION TASKS. ..........................................................................................................................99
TABLE 10: NUMBER OF TIMES EACH RENAMING METHOD WAS USED FOR RENAMING TASKS........................100
xii
List of Illustrations
FIGURE 1: THE IDENTIFIER INSTANCES IN THE COPIED CODE ARE MATCHED WITH THEIR CORRESPONDING
IDENTIFIER INSTANCES IN THE PASTED CODE. .......................................................................................18
FIGURE 2: THE IDENTIFIER INSTANCES IN THE COPIED AND PASTED CODE ARE PARTITIONED INTO GROUPS AND
MAPPED TO EACH OTHER. .....................................................................................................................19
FIGURE 3: THE POSITION OF THE SOURCE CODE CHARACTERS AS REPRESENTED IN AN ASTNODE. ................25
FIGURE 4: THE THREE CASES WHEN CAPTURING A RANGE OF SOURCE CODE USING THE ECLIPSE AST API. ..26
FIGURE 5: CNP CLONE VISUALIZATION HAS DISTINCTION BETWEEN CLONE GROUPS AND THE CLONE ORIGIN
AND ITS SUBSEQUENT PASTES. ..............................................................................................................29
FIGURE 6: CSER SHOWS THE CHANGES THAT WOULD BE MADE TO THE EXCLUSIONINCLUSIONDIALOG CLASS
(HIGHLIGHTED CODE FOR INSERTS, DELETES, UPDATES, MOVES; AND HOVER INFORMATION FOR
DELETES, UPDATES) TO MAKE THE SETFILTERWIZARDPAGE CLASS IN SETFILTERWIZARDPAGE’S FILE
IN THE ECLIPSE EDITOR.........................................................................................................................30
FIGURE 7: THE CLONE LIFECYCLE – CLONE CREATION, CLONE CAPTURE, CLONE EDITING, AND CLONE
EXTINCTION. ........................................................................................................................................40
FIGURE 8: CONSISTENT IDENTIFIER RENAMING WITHIN A CLONE USING CREN..............................................50
FIGURE 9: THE PROGRAMMER CAN CHOOSE TO RENAME AN INSTANCE SEPARATELY FROM THE OTHERS
(NOTICE THAT ONE “I” IN THE PASTED LOOP ON LINE 33 IS NOT BEING RENAMED AS A “J” WITH THE
OTHERS ANYMORE)...............................................................................................................................51
FIGURE 10: THE ABSTRACT SYNTAX TREE (AST) OF A FOR LOOP WITH THE IDENTIFIER GROUPS HIGHLIGHTED..............................................................................................................................................................53
FIGURE 11: LEXID CHANGES THE SUBSTRINGS “LEFT” TO “RIGHT” WHEN ONE IS EDITED. IN THE FUTURE, LEXID CAN BE MADE TO AUTOMATICALLY INFER THE SUBSTRING “RIGHT” IN THE PASTED CODE BASED
ON “LEFT” BY MAINTAINING A DATABASE OF COMMON NAMING PAIRS. ...............................................55
FIGURE 12: LEXID RENAMES A SUBSTRING “B” TO “Y” CONSISTENTLY IN PASTED CODE. ..............................56
FIGURE 13: A NEW FEATURE OF LEXID CAN BE SUPPORT FOR AUTO-INCREMENTING TOKENS (LEFT) AS WELL
AS LEXICAL PATTERNS IN IDENTIFIERS (RIGHT).....................................................................................57
FIGURE 14: LEXID CAN BE MADE TO INFER THAT THE CONSTRUCTOR THAT IS CALLED WITHIN A COMMON
METHOD SHOULD BE THE SAME AS THE CURRENT SUBCLASS’ NAME (“XXX”). ....................................58
FIGURE 15: FIND & REPLACE CAN RENAME ALL INSTANCES OF “I” (AS A WHOLE WORD) TO “J” IN THE
SELECTED LINES, BUT THIS NEEDS TO BE SPECIFIED BY THE PROGRAMMER AND IS SIMPLY A TEXT-BASED SEARCH. ....................................................................................................................................61
FIGURE 16: RENAME REFACTORING DOES NOT WORK WITH CODE THAT DOES NOT TYPE CHECK (BINDING IS
REQUIRED FOR IT TO WORK)..................................................................................................................62
xiii
FIGURE 17: CREN WORKS WITH CODE THAT DOES NOT TYPE CHECK (BINDING IS NOT REQUIRED FOR IT TO
WORK). .................................................................................................................................................62
FIGURE 18: RENAME REFACTORING IS NOT LIMITED TO RENAMING WITHIN A CLONE (FOR EXAMPLE, ONLY IN
THE PASTED FOR LOOP). ........................................................................................................................62
FIGURE 19: REFACTORING (TOP) VS. CREN (BOTTOM). .................................................................................63
FIGURE 20: CREN WORKS ACROSS MULTIPLE FILES (FILE 1 IS ON TOP, FILE 2 IS ON THE BOTTOM).................64
FIGURE 21: LINKED RENAMING DOES NOT WORK WITH CODE THAT DOES NOT PARSE (NOTICE THE ADDED
SEMI-COLON BETWEEN THE ++ ON LINE 33)..........................................................................................64
FIGURE 22: CREN WORKS WITH CODE THAT DOES NOT PARSE (NOTICE THE ADDED SEMI-COLON BETWEEN
THE ++ ON LINE 33). .............................................................................................................................65
FIGURE 23: LINKED RENAMING IS NOT LIMITED TO RENAMING WITHIN A CLONE (FOR EXAMPLE, ONLY IN THE
PASTED FOR LOOP)................................................................................................................................65
FIGURE 24: THE CMU PAINT PROGRAM USED IN THE USER STUDY WITH WIDGETS ANNOTATED BY
CORRESPONDING INSTANCE VARIABLES. ..............................................................................................78
FIGURE 25: TASK 1 – RSLIDER SHOULD BE BSLIDER (ON LINE 120)................................................................82
FIGURE 26: TASK 2 – COLORCHANGELISTENER SHOULD BE THICKNESSCHANGELISTENER (ON LINE 142). ...83
FIGURE 27: TITLED BORDERS ARE SHOWN AROUND THE COLOR PANEL AND THE THICKNESS PANEL..............84
FIGURE 28: TASK 3 – ADD A TITLED BORDER TO COLORPANEL AND TO THICKNESSPANEL.............................85
FIGURE 29: THE LABELS OF THE RED, GREEN, AND BLUE SLIDERS ARE SHOWN COLORED...............................86
FIGURE 30: TASK 4 – ADD COLOR TO THE LABEL OF EACH COLOR SLIDER: RED, GREEN, AND BLUE. ..............87
FIGURE 31: TASK 5 – RENAME COLORPANEL TO THICKNESSPANEL. ..............................................................89
FIGURE 32: TASK 6 – RENAME TOOLPANEL TO CLEARUNDOPANEL. ..............................................................90
FIGURE 33: TASK 7 (PART 1) – RENAME RPANEL TO GPANEL AND RSLIDER TO GSLIDER IN THE GREEN SLIDER
CLONE...................................................................................................................................................91
FIGURE 34: TASK 8 – RENAME BPANEL TO TPANEL AND BSLIDER TO TSLIDER IN THE THICKNESS SLIDER
CLONE...................................................................................................................................................92
xiv
List of Publications [1] P. Jablonski and D. Hou, “Renaming Parts of Identifiers Consistently within Code
Clones”, IEEE International Conference on Program Comprehension (ICPC), 2010. (2 pages)
[2] P. Jablonski and D. Hou, “Aiding Software Maintenance with Copy-and-Paste
Clone-Awareness”, IEEE International Conference on Program Comprehension
(ICPC), 2010. (10 pages) [3] F. Jacob, D. Hou, and P. Jablonski, “Actively Comparing Clones Inside The Code
Editor”, International Workshop on Software Clones (IWSC), 2010. (8 pages) [4] D. Hou, F. Jacob, and P. Jablonski, “Exploring the Design Space of Proactive Tool
Support for Copy-and-Paste Programming”, IBM Conference of the Centre for
Advanced Studies on Collaborative Research (CASCON), 2009. (15 pages) [5] D. Hou, F. Jacob, and P. Jablonski, “Proactively Managing Copy-and-Paste
Induced Code Clones”, IEEE International Conference on Software Maintenance
(ICSM), 2009. (2 pages) [6] D. Hou, P. Jablonski, and F. Jacob, “CnP: Towards an Environment for the
Proactive Management of Copy-and-Paste Programming”, IEEE International
Conference on Program Comprehension (ICPC), 2009. (5 pages) [7] P. Jablonski, “Clone-Aware Editing with CnP”, ACM SIGSOFT International
Symposium on the Foundations of Software Engineering (FSE), Student Research Forum, 2008. (poster)
[8] P. Jablonski, “Techniques for Detecting and Preventing Copy-and-Paste Errors
during Software Development”, Clarkson University, PhD Dissertation Proposal, 2007. (21 pages)
[9] P. Jablonski and D. Hou, “CReN: A Tool for Tracking Copy-and-Paste Code
Clones and Renaming Identifiers Consistently in the IDE”, Eclipse Technology
Exchange Workshop at OOPSLA (ETX), 2007. (5 pages) [10] P. Jablonski, “Managing the Copy-and-Paste Programming Practice in Modern
IDEs”, ACM SIGPLAN Conference on Object-Oriented Programming, Systems,
Languages, and Applications (OOPSLA), 2007. (2 pages)
1
Chapter 1
Introduction
1.1. Copy, Paste, and Modify Programming
CCopy and paste [236, 237, 238, 239, 240] – some people love it, others hate it. Why?
Copying and pasting obviously provides some short-term benefits such as saving
typing and remembering a name’s spelling. In a study on copy-and-paste usage,
approximately 74% of programmers copied very small pieces of code of less than a single
line (such as variable names, type names, or method names) [132], which indicates that
they were copying and pasting for these kinds of reasons.
The same study also concluded that the programmers on average made four non-
trivial copy-and-pastes per hour [132]. It seems natural for programmers to copy and
paste larger code fragments (such as blocks, methods, or classes) when they see a similar
existing solution to their current task rather than write the new software solution entirely
from scratch. Not only can copying and pasting make programmers more productive in
this way, but it can be especially useful when working in an unfamiliar domain, for
instance, when learning a new programming language or framework. To help get started,
programmers can copy and paste examples from the framework’s documentation [28],
from a software repository consisting of past projects [87, 92, 217], or from an online
search engine (such as Google Code Search) [28, 86] to use as a base to work from.
All programming is maintenance programming,
because you are rarely writing original code.
- Dave Thomas
Copy and paste is a design error. - David Parnas
Copying all or parts of a program is as natural to
a programmer as breathing, and as productive.
- Richard Stallman
2
Reusing Source Code Examples Example-based programming is a legitimate form of software reuse (unlike cases
of copying and pasting in order to plagiarize [176, 198, 218], which a variety of
plagiarism detection tools have been developed to help deter, including AntiPlagiarist,
CopyCatch, DOC Cop, Eve2, Glatt, GPlag, JPlag, MyDropBox, PAIRwise, SNITCH,
SPlaT, TurnItIn, and WCopyFind*). Research findings in the psychology and AI fields
verify that working with concrete examples can be advantageous [51, 209]. However,
though some software components are especially designed to be reused (such as libraries,
frameworks, APIs, and software product lines), not all examples that a programmer may
find were specifically made for reuse purposes. As such, the programmer must be careful
to extract only the functionality that is needed for reuse, while also dealing with
dependencies that this code fragment may have to other parts of the software. Tool
support has been developed to aid programmers in the whole process of pragmatic reuse
[85, 86, 88, 89, 90, 91], reengineering [64], and in the comparison of examples [42].
The Psychology of Software Reuse Novices generally copy and paste when they do not have a full understanding of
the programming task. Since they are new to programming or to a particular language,
they do not have the syntactic, semantic, and schematic knowledge that experts have in
order to craft a solution. Novices are not the only ones who copy and paste for reuse,
however. According to [51, 52], expert programmers have “schemas” (plans) that
* http://www.anticutandpaste.com/antiplagiarist/, http://www.copycatchgold.com/, http://www.doccop.com/, http://www.canexus.com/, http://www.plagiarism.com/, http://research.microsoft.com/apps/pubs/default.aspx?id=73093, https://www.ipd.uni-karlsruhe.de/jplag/, http://www.mydropbox.com/, http://www.pairwise.cits.ucsb.edu/, http://actlab.csc.villanova.edu/simtools/, http://splat.cs.arizona.edu/, http://www.turnitin.com/, http://plagiarism.phys.virginia.edu/Wsoftware.html
3
represent generic solutions kept in their memories specific to a programming domain that
they can retrieve and instantiate to solve a particular programming problem. In other
words, as experts become familiar with a problem domain, they develop domain-specific
schemas, representing their knowledge of certain types of problems [52], which they can
later recall to help them design a new program. Having prior knowledge and experience,
expert programmers can use their familiarity with the situation to gain efficiency and the
ability to solve more difficult tasks than if they had to design the solution entirely from
scratch. Routine tasks can even become impossible to do if every part is treated as new
[51, 52]. Humans naturally reuse knowledge from prior experience in the present time.
The copy and paste of source code (both large and small) tends to be a natural
behavior that provides immediate benefits. The copy-and-paste operation is not bad by
itself, but the result of copying and pasting is what is considered bad, since the resulting
clones need to be consistently modified and maintained in the long-term (the “modify”
part of “copy, paste, and modify programming” [234]). Still, many people continue to
strongly dislike copy-and-paste itself and blame it as the culprit of the maintenance
problem of clones (which often leads to code inconsistencies) [182]. This and some other
perceived problems of code clones are discussed in the following section.
1.2. The Traditional Perspective: Clones are Bad Traditionally code cloning was considered “harmful” to a system. Some problem
areas include software maintenance, evolution, quality, and code aesthetics or design.
So, copy-and-paste is not necessarily bad in the
short run, if you are copying good code. But it is
always bad in the long run.
- Ralph Johnson
4
Clones as a Software Maintenance Problem Copying and pasting within the same code base results in code duplication [243]
that needs to be properly managed and maintained. The clones are exactly the same when
initially copied and pasted, but start to differ as the newly pasted code is modified to fit
its task. At the time the copy and paste occurs, the programmer sees the similarity
between the clones (otherwise he or she would not have made an exact duplicate as a
base to work from) and he or she also has an idea of the differences that need to be made
for the new code to be properly adapted. A natural dependency exists between the clones,
which are assumed to have a certain level of similarity that must remain between them.
This invisible relationship between copied and pasted code fragments consists of the
correspondences and differences between the clones that must be maintained as the
software is updated, for example, with new features and bug fixes. It is important for the
software maintainer to remember the parts of the related clones that should remain
unchanged, parts that must change in the same way, and parts between the clones that are
meant to differ [72]. Identifying the locations of all clones in a system and remembering
their invisible relationships to one another can be extremely difficult over time.
Clones as a Software Evolution Problem As changes to the software (like new features or bug fixes) are required over time,
the clones in the system may also naturally change. In some cases, the programmer may
have copied and pasted in order to get a quick solution rather than taking the time to
create an abstraction such as a procedure, function, or method. If so, these clones are
likely to be replaced by an abstraction as the code matures. The issue here is that even
5
though the creation of the clones is avoidable to begin with and the clones will eventually
disappear anyway, there is still a time when the clones exist in which they need to be
properly maintained. Though these particular clones are only in the system temporarily
and their entire life may be short, there is still significant effort needed in refactoring the
code. On the other hand, perpetual clones are problematic in that they require continuous,
long-term maintenance.
Clones as a Software Quality Problem The increase in source code maintenance is not the only concern of opponents to
code cloning. The potential increase in the number of software bugs in the system is one
of the most widely cited reasons for avoiding clone creation. Some scenarios where bugs
are introduced into the system as a result of cloning include:
• The addition of a new feature: When the system needs to be updated to include a
new feature, the software maintainer must know whether to apply this particular
change to all related clones or only to some of them. If the maintainer fails to
apply this change to all of the correct clones, a bug (inconsistency) is made.
• A bug is propagated and fixed: It is possible that the original code that was copied
had an existing bug in it that has now been multiplied as it was pasted throughout
the system. Once this bug has been noticed, it then needs to be fixed in all clones
that it is in. If one of those bugs is not fixed, there remains an inconsistency,
which is actually a new bug introduced into the system!
• A clone is modified to fit its task: Changes are made to a single clone when it is
being modified to fit its own individual task. The newly pasted code fragment
typically has identifiers changed to a new name related to the current task. If all
6
identifier instances are not renamed consistently within the code fragment, this
will create an inconsistency (bug).
In all of these cases, the clone-related bugs can remain undetected. It may take a long
time for the absence of a new feature in a clone to be detected (especially if that part of
the software is not used often in practice). In the second case, since the existing bug was
not detected earlier, it is possible that the same bug might remain hidden somewhere else
in the code. Lastly, though a renaming inconsistency could be caught by the compiler,
there are cases when the unchanged identifier instance is still in scope (Section 2.3 –
Errors), which can remain undetected by both the compiler and programmer. All of these
clone-related bugs occur when the implicit rules in the cloning relationship are broken.
Clones as an Aesthetic or Design Problem In addition to the potential decrease in software quality, some people say that
clones in software just look bad and that their presence in the code might indicate an
underlying design problem. Clones can artificially increase the number of lines of code
by adding “unnecessary” lines that otherwise would be in the body of a single abstraction
[226]. Charles Simonyi, who introduced the concept of “intentional programming” [29,
222], is a proponent of programming with abstractions rather than with clones. He states
that “...it is still pretty easy to decide at a glance that the code is bad – by the identifiers,
by the juxtapositions, by the size of the expressions, or by evidences of code copying”
Number 1 in the stink parade is duplicated code.
If you see the same code structure in more than
one place, you can be sure that your program will
be better if you find a way to unify them.
- Kent Beck and Martin Fowler
7
[221]. But he also says that a program can still be beautiful even if it is not strictly
structured, as long as the program has other redeeming features [159].
Code clones are often labeled as a “code smell” [235], which is a hint that
something could be wrong with the code. This part of the code should be inspected
further to determine whether there is actually a problem that needs to be fixed or that the
smell can just be tolerated [179]. The term “clone smell” [13] was later made to describe
an individual clone that appears to be problematic over time, which should be looked at.
The existence of clones may indicate a design problem, since it could be that the
programmer did not fully think through the design of the software solution if abstractions
were not used wherever possible. Abstraction-supported programming languages are
designed so that programmers can take advantage of these powerful tools [29]. So, when
programmers do not use the abstractions (for whatever reason) [150], they are not getting
all of the benefits that the programming language has to offer and they may not be
properly utilizing the language as it was intended to be used by design. If the clones are
to be refactored out of the code later on anyway, it might be worth spending the effort
and time to design the abstractions correctly from the beginning. Martin Fowler sees a
connection between a code’s look and smell: “I wrote that about aesthetics in discussing
when you apply refactorings. To some extent, the situations I describe in the refactoring
guidelines are fairly vague notions of aesthetics. But I try to provide more guidance than
just saying, ‘Refactor when the code looks ugly.’ I say, for instance, that duplicated code
is a bad smell. I say that long methods are a bad smell. Big classes are a bad smell.”
8
1.3. A New Perspective: Clones can be Good Duplicated or cloned code is often considered harmful to software quality,
however it can also be a reasonable or beneficial design option. Cloning can be done with
“good intentions”, including when 1) it keeps the code clean and understandable rather
than introducing an unreadable, complicated abstraction, and 2) the programming
language lacks expressiveness, so a trusted solution is reused (for example, in COBOL)
[122, 125]. If a procedure would have too many parameters or if a programming language
does not support abstractions, then clones can be a viable alternative.
There are times when it is advised to keep clones in the source code. An empirical
study of code clone genealogies that looked at clones over multiple versions of a program
[137], found that it may not be worth refactoring short-lived clones if they are likely to
diverge soon and that the long-living clones are often in the system due to shortcomings
of the programming language. As a result, limitations of the programming language
design may result in unavoidable duplicates in a code [132]. Research from Cordy claims
that making changes to clones (which includes refactoring them) can be considered risky
from a corporate standpoint, so to be safe, the clones should remain in the system [39].
If you have a procedure with ten parameters, you
probably missed some. - Anonymous
9
Have People Been Led Astray? According to MythSE 2007, the statement that “clones are evil” is actually a myth
in software engineering [81]. Various facts are used to refute the myth, including [8, 39,
122, 125, 137, 151, 180, 191] with reasons explained on the website [81]. Godfrey says
that people may have been led astray like sheep, in their thinking as a group that cloning
is bad. He reiterates that cloning (or starting with the familiar) is both natural and good.
For example, he claims that in both arts and life, people explore new things by carefully
venturing away from the familiar and that humans find comfort in ritual, and more
importantly, repetition of trusted design elements is a part of engineering [74].
Regardless of the outcome of the debate about the value of copy-and-paste and
cloning, this PhD research focused on the fact that code clones do exist and thus need to
be managed. Even if clones are made with good intentions or out of necessity, they can
still be problematic if not handled properly. One contribution of this work, the software
tool CnP, is a proactive clone management environment that tracks copy-and-paste-
induced clones upon creation. Based on the tracked cloning information, CnP provides
support for clone-related maintenance activities. This dissertation shows how CnP’s
support for copy-and-paste clone-awareness may be able to help programmers benefit
from this clone information during debugging and modification tasks, develop software
more efficiently, and prevent inconsistent identifier renaming within clones. A user study
was performed to measure the effects of this kind of clone-aware programming.
All we like sheep have gone astray. - Isaiah 53:6
10
1.4. Research Contributions The main contributions of this research included:
• The copy-and-paste (CnP) tool
o Proactive tracking – CnP/CReN were the first known clone tracking
tools published (in 2007), which took a more proactive approach to
capturing clones upon creation (by detecting when a copy and paste occurs
and gathering the initial clone and identifier information at that time when
the clones are identical).
o Intra-clone editing – CReN was the only known tool to support editing
within a clone (all previous tools only supported between-clone editing).
Intra-clone editing is done when programmers copy, paste, and modify the
pasted code to fit the current task. The kind of modification that is made in
these cases is often identifier renaming, which is what CReN supports.
o AST-based – CnP makes use of the abstract syntax tree (AST)
representation of the source code, which is a better approach than the text-
based methods that cannot differentiate between source code and any other
text. CSeR is one of the few differencing tools to take advantage of ASTs.
• Dimensions of clone tracking tool development – When comparing CnP with
related clone tracking tools, a variety of clone properties were determined that
these kinds of tools must explicitly define. Listing the properties can be useful in
the creation of new tools or to help redefine a tool’s current property definitions.
11
• Definition of the clone lifecycle – The comparison of tools also led to a definition
of the clone lifecycle stages, including some areas where there is current tool
support and areas that need more support.
• Realization about clone visualization – After completing a user study on CnP,
CnP’s clone visualization was not found to provide statistically quicker and
correct solutions than without it. Observation and other analysis (in Section 5.1)
helped better determine whether and when a programmer may exploit clone
information. There is no other known similar analysis of the role of clone
information in maintenance tasks, and, thus the analysis in and of itself can be a
contribution. The analysis can be used in the design of future experiments.
1.5. Outline of This Dissertation This dissertation first presents the traditional perspective on copying and pasting
and code cloning (Section 1.2), including the clone detection and removal approach
(Section 2.1). It then introduces the new perspective that states that even though cloning
can be problematic, clones can be reasonable and beneficial to a software system (Section
1.3). Furthermore, since these clones can be in the source code for any length of time, this
dissertation proposes that clones should be managed throughout their lifecycles until
extinction, that is, if they ever get to that stage (Section 2.2).
As most of the problems with cloning revolve around the issue of software
maintenance, support for modification or editing is the main focus of the related clone
tracking tools (Section 2.2.3.3). An additional distinction between these clone tracking
tools is whether they are proactive or retroactive, that is, whether they start capturing
clone information upon the clone’s creation (via copy and paste) or whether they use
12
clone detection or clone selection by the programmer, which can start the clone tracking
much later in the clone’s life (Section 2.2.3.2). Each tool can also define the properties of
clones differently, with some tool designs and implementations preferred over others
(Section 2.2.2).
Finally, this dissertation presents the design (Chapter 3) and results (Chapter 4) of
a user study that tested the CnP tool’s basic visualization and renaming features, followed
by a discussion related to this study (Chapter 5). Lastly, this paper contains a conclusion
and future work (Chapter 6).
13
Chapter 2
Literature Review
2.1. Clone Detection and Removal Clone Detection There is a wide variety of clone-related research [148, 149]. Traditionally, much
of the focus has been on clone detection [162, 211, 213, 214] and removal. In this field,
researchers often contribute a variety of clone detection techniques, including algorithms
[57, 60, 61, 69, 109, 110, 112, 113, 120, 193, 207, 215, 216], heuristics [17, 18] and
processes [158]. Many early algorithms made use of program dependence graphs (PDGs)
[20, 63, 93, 94, 144, 152] and program slicing [24, 145]. Beginning research dealt with
finding exact code duplicates, while later work expanded to detect “near-miss clones”
(code fragments that are not identical, but have some level of similarity) [10, 11, 21, 40,
212, 223]. Some algorithms were implemented as clone detection tools [22, 23] (such as
AntiCutAndPaste, CCFinderX, Clone Digger, CloneDR, Dup, Duplo, DupMan, Moss,
SDD, Simian, and SimScan*) whose purpose is to find code clones in pre-existing code.
* http://www.anticutandpaste.com/anticutandpaste/, http://www.ccfinder.net/ccfinderx.html, http://clonedigger.sourceforge.net/, http://www.semdesigns.com/Products/Clone/, http://cm.bell-labs.com/who/bsb/research.html, http://sourceforge.net/projects/duplo/, http://sourceforge.net/projects/dupman/, http://theory.stanford.edu/~aiken/moss/, http://wiki.eclipse.org/index.php/Duplicated_code_detection_tool_(SDD), http://www.redhillconsulting.com.au/products/simian/, http://www.blue-edge.bg/download.html
If something is worth doing once,
it's worth building a tool to do it.
- A Software Engineering Proverb
Software entities are more complex for their size
than perhaps any other human construct because
no two parts are alike (at least above the
statement level). If they are, we make the two
similar parts into a subroutine – open or closed.
In this respect, software systems differ profoundly
from computers, buildings, or automobiles, where
repeated elements abound.
- Frederick P. Brooks, Jr.
14
Clone detection tools are retroactive and as a result, can reveal a number of false
positives and false negatives that must be sorted through by the programmer. The fact
that humans need to go through a clone detection tool’s results to verify its accuracy in
returning actual clones of interest is a major disadvantage of these kinds of tools.
Clone Removal People who dislike copy-and-paste and code clones tend to want to solve the
problems of cloning by removing the clones from the system as soon as possible. The
main reason for clone detection has been for subsequent clone removal, that is, to get rid
of the clones in legacy systems (already existing source code). As previously mentioned,
this approach is retroactive and thus is not solving the problem as it happens. On the
other hand, one way of proactive “clone prevention” [21] that is suggested is to simply
run a clone detection tool on the code as it is being developed, so that the clones can be
removed instantaneously by the programmer. Others even suggest preventing the creation
of clones by disabling the copy and paste functionality in the programming editor! But,
prevention is not enough, since some clones must or should remain in the source code.
The most common method of clone removal is refactoring [67], which means to
restructure or change the source code without changing its external functional behavior.
One of the most common forms of refactored clones is as a functional abstraction – to
replace the multiple, similar code fragments with a single procedure [142, 143] to make
maintenance easier since updates could be made in one spot. The common portion
between the clones would be the function body and the differences would be handled by
the function parameters. Cloned classes can be refactored such that “a base class
encapsulates the commonalities and the derived classes specialize in the peculiarities”
15
[74]. Using generics [108] and templates for classes [19] can also add an acceptable form
of abstraction into the system thus eliminating class-level clones. Other forms of
refactored clones [74, 148] include: macros [3], design patterns [148], program slices
[71], and software product lines [68, 184, 185]. The process of code refactoring can be
error-prone when done manually [79], but there is some default refactoring support in the
IDE (like renaming and moving [252]) and separate refactoring tools (such as [78, 79, 82,
84, 227]), which can help the programmer determine how and where to refactor.
When to Refactor There are varying perspectives about when to refactor. Purists believe that all
code smells (including code clones) should be avoided with no exceptions [235]. They
agree with the “Don’t Repeat Yourself (DRY)” principle, which states that “every piece
of knowledge must have a single, unambiguous, authoritative representation within a
system” [242]. The Extreme Programming (XP) software development methodology calls
this “Once and Only Once” (that is, that “each and every declaration of behavior should
appear once and only once”) [244]. Followers of these rules would favor refactoring to
make a single abstraction as soon as possible. The “rule of thumb” of when to refactor,
however, states that copying and pasting of the same code is allowed up to three times
until the clones should be refactored [246, 247], called the “Rule of Three”. In general, it
takes at least three applications of something for it to be considered a pattern [247], so it
seems that the “Rule of Three” would be what is more often done naturally in practice.
The first time you do something, you just do it.
The second time you do something similar, you
wince at the duplication, but you do the duplicate
thing anyway. The third time you do something
similar, you refactor. - Don Roberts
16
Despite the potential benefits of refactoring to make the code more maintainable
and less complex, refactoring can be done prematurely before it would happen naturally.
This could be problematic and require significant effort to fix. Also, creating an
abstraction can be difficult or impossible, for example, due to the programmer’s inability
to create the abstraction [76, 150] or due to language constraints. Furthermore, even
though there are rules about when to refactor, the rules can be broken, which would leave
clones in the system that need to be managed for a temporary or extended period of time.
2.2. Clone Lifecycle Management Since clones will continue to exist and some clones may even be intentionally
permanent, tool support is needed for all stages of the clone lifecycle. The term “clone
management” has been used to refer to “clone removal” [146, 147] and also one kind of
“clone editing” that links together clones for common changes to be made simultaneously
among them [54, 55, 189, 231]. Both “clone editing” and “clone removal” (in other
words, clone extinction) are parts of the clone lifecycle that can be managed with the aid
of software tools. This dissertation presents the dimensions of a software tool, CnP,
which provides copy-and-paste-induced clone management in the Eclipse IDE.
2.2.1. Brief CnP Tool Descriptions The entire suite of Eclipse plug-ins from this research that support copy, paste,
and modify programming are called CnP. At the time of this writing, the CnP project
Cloning is a good strategy if you have the right
tools in place. Let programmers copy and adjust,
and then let tools factor out the differences with
appropriate mechanisms. - Ira Baxter
17
consists of three plug-ins: CReN (for consistent identifier renaming), LexId (for
consistent substring renaming), and CSeR (for clone comparison). All CnP plug-ins
utilize the abstract syntax tree (AST) source code representation that is available in the
Eclipse framework. First, the tools track the cloning relationship right when the code is
copied and pasted before any changes are made. Each clone’s location is accurately
tracked according to its starting character position and length in number of characters
within a source code file. Only copied and pasted code that is fully contained within an
AST node is captured in this model. Related clones from the same copy and paste
sequence are also noted (Section 2.2.2.2 – Clone Model).
CnP’s basic visualization (used in CReN and LexId) consists of colored bars next
to the clone’s code fragment within the source code file. CSeR has its own unique
method of visualization that differentiates between inserts, deletes, updates, and moves,
highlighting each kind of user-made change with a different color (Section 2.2.2.3 –
Clone Visualization).
In addition to clone tracking and visualization, CReN and LexId track identifiers
within these related clones. First, the identifier instance locations between the clones
(which are AST leaf nodes of type SimpleName) are matched, which represents the
correspondence relationship, as in Figure 1. (Note: this correspondence is not used by
CReN or LexId yet). Then, all of the same identifier instances are grouped together,
which are assumed to be renamed together consistently, as in Figure 2. This way when
the programmer edits any one of the identifier instances, all others of the same program
element or name are renamed with it automatically and consistently. All identifier
18
instances that are currently being edited within a clone are shown boxed, similar to
Eclipse’s Linked Renaming (Section 2.2.3.3 – Clone Editing).
Figure 1: The identifier instances in the copied code are matched with their
corresponding identifier instances in the pasted code.
19
Figure 2: The identifier instances in the copied and pasted code are partitioned into
groups and mapped to each other. LexId further adds onto this default functionality of CReN by tracking and
grouping together common substrings between the different identifiers within a clone.
LexId tracks corresponding identifier pieces and renames these identical parts of
identifier names consistently together within copied and pasted code fragments. All
instances of a common substring between all identifiers within a clone are renamed
together as one of those is renamed by the programmer (Section 2.2.3.3 – Clone Editing).
2.2.2. Definitions of Clone Properties Certain properties of clones need to be explicitly defined when creating a software
tool that tracks code clones. CnP and related software tools can define each clone
property in different ways. The following subsections give a variety of definitions that are
used for clone similarity, clone model, clone visualization, clone persistence, and clone
20
documentation and clone attributes. Table 1 (on the next page) summarizes the design
and implementation details for each of the related clone tracking tools: Clonescape [38],
CPC [251], Codelink [231], LAPIS [189], and CloneTracker [54, 55], including CnP [95,
96, 97, 100, 101, 102] (and its parts: CReN consistent identifier renaming [103], LexId
consistent substring renaming [104], and CSeR clone comparison [106, 107])*, and it
specifically highlights the problems that the related tools did not address that CnP does.
The emphasis of these six tools, in particular, is in supporting the editing phase of the
lifecycle to avoid inconsistent modifications to clones.
2.2.2.1. Clone Similarity As mentioned in Chapter 1, programmers often copy and paste (which creates
code clones) when they see a similarity between existing code and the current task at
hand. Research in the psychology field agrees that people’s minds work in this way –
new problems are often solved by using prior problems’ solutions [51, 52, 65, 73, 160,
170, 253]. People, even as children, recognize analogy and similarity when comparing
things and they know the correspondence relationship between the objects, whether the
object attributes are shared (similarity) or not (analogy) [73].
* http://s88387243.onlinehome.us/wiki/Clonescape/, http://cpc.anetwork.de/, http://harmonia.cs.berkeley.edu/harmonia/projects/codelink/, http://www.cs.cmu.edu/~rcm/lapis/, http://www.cs.mcgill.ca/~swevo/clonetracker/, http://www.clarkson.edu/~dhou/projects/CnP/
Software clones are segments of code that are
similar according to some definition of similarity.
- Ira Baxter
21
Table 1: Summary of Clone Tracking Tools with their Definitions of Clone Properties
22
In general, code clones are defined as “similar” code fragments in software, from
a few lines of code to whole files. The similarity relationship between clones is often
defined in terms of the characteristics of the code that make up the clones such as its text,
syntax, semantics, or pattern [148]. Four types of clones have been defined [23]:
• A Type 1 clone is an exact copy without modifications (except for white space
and comments).
• A Type 2 clone is a syntactically identical copy in which only variable, type, or
function identifiers were changed.
• A Type 3 clone is a copy with further modifications such that statements were
changed, added, or removed.
• And a Type 4 clone is a semantically (or functionally) equivalent segment, which
may differ significantly in terms of textual equivalence.
Clones that are a result of copying and pasting usually remain textually similar (Types 1-
3) [23] and are the kind of clones that most clone detection research has focused on.
Semantic clones (Type 4), however, can be very difficult [69] or nearly impossible to find
retroactively [23]. All clone detection tools rely on some notion of similarity in source
code in order to define clones and they return “sets of code blocks within a user-supplied
similarity threshold of each other” [223]. But, clone detection tool results are not perfect,
even for identical code, since other things like clone boundaries need to be considered.
Like with clone detection tools, determining the similarities and differences
between code fragments is also useful in managing clones. The next two subsections
explain some ways that clone tracking tools use similarity to define what a clone is and
how to manage these clones, respectively.
23
Defining Clones For the retroactive tools that rely on clone detection (CloneTracker), there is a
level of similarity that must exist for existing code pieces to be considered clones that is
defined by the clone detection tool. For the retroactive tools that rely on the
programmer’s selection (Codelink and LAPIS), the initial level of similarity is defined by
the programmer who is selecting the clones. Either selecting clones or using the clone
detection tool, if done after the cloning relationships have been forgotten by the
programmer, can yield inaccurate clones. For proactive tools that capture copy-and-paste-
induced clones (CnP, Clonescape, and CPC), the new code fragment is guaranteed to be a
clone and is identical to the original when initially pasted. Because of this, proactive tools
only need to consider what happens to the similarity between clones as they evolve.
Managing Clones CnP’s approach to the definition of clone similarity can be characterized as being
constructive and extensional. For example, the consistent renaming (CReN) portion of
CnP manages similarity such that clones in the same clone group all have corresponding
identifiers, which must be renamed together in each clone. The corresponding identifier
groups need to be constructed ahead of time and tracked thereafter. This correspondence
between identifiers can thus be considered as part of the similarity between clones within
the same clone group. In addition to identifier extraction, LexId goes further by grouping
and tracking parts of identifiers (substrings) together. The CSeR correspondence map
currently tracks fields, methods, parameters, conditional expressions, method calls,
simple names, and literal constants between the clone and its origin. It also uses the
Levenshtein Distance (LD) to connect similar but not identical changes as an “update”.
24
Codelink uses the longest-common subsequence (LCS) algorithm (like the one
implemented by the UNIX Diff utility) to determine the commonalities and differences of
clones within a clone group. The main shortcomings of the LCS algorithm include its
potentially long running time and lack of intuitive results [231].
The most popular method of code similarity in related work seems to be the
Levenshtein Distance (LD) (in Clonescape, CPC, CloneTracker, and CSeR), which is a
metric of the amount of editing (the edit distance) needed to make two strings the same.
CloneTracker does its line mapping technique by calculating the LD for two lines of code
at a time. Unlike the constructive, extensional nature of CReN and LexId’s approach, the
code can be tokenized whenever LD needs to be calculated. Thus, LD is not calculated
ahead of time and there is no need to track the result of LD. Also, since the Levenshtein
Distance only returns a numerical value representing clone similarity, it will not tell
additional information about similarity, like which parts of each clone are different.
CReN and LexId’s notion of similarity, on the other hand, is purely syntax-based and
requires parsing to reveal the exact commonalities and differences among clones.
2.2.2.2. Clone Model The following subsections describe the clone model for each tool, both in terms of
how clone locations and clone relationships are represented.
Clone Location CnP and other clone-related tools that use a tree-based representation of the
source code specifically use the abstract syntax tree (AST) API provided in the Eclipse
JDT framework [157]. In Eclipse, an AST node (ASTNode) contains a part of the
25
program’s source code. The source code characters and their absolute position in the
source code file are captured in the AST. Each ASTNode has a starting position that
denotes the numeric position of the first character in the node’s content and an ending
position that denotes the numeric position of the last character in the node’s content. An
ASTNode node’s character starting position can be represented as StartPos, whose value
can be retrieved with the Java code: node.getStartPosition() and its character ending
position can be represented as EndPos, whose value can be calculated with the Java code:
node.getStartPosition() + node.getLength() – 1, as shown in Figure 3.
Figure 3: The position of the source code characters as represented in an ASTNode. CnP represents the actual source code that is copied and pasted to the largest
continuous set of whole AST nodes within the range. The beginning of the code fragment
(that is selected and copied-then-pasted) can be denoted as BegIntRange and the end of
the code fragment can be denoted as EndIntRange, which defines the range. The case
which CnP supports is when the node is all within the range (in other words, CnP
captures only the nodes that are fully contained within the copied-and-pasted code
fragment), which is case 1 in Figure 4. In this case, the node that is captured is:
if(BegIntRange <= StartPos && EndIntRange >= EndPos). Copied and pasted source
code that is only partially contained within an AST node is not captured in this
26
representation (CnP does not capture the node’s contents for cases 2 and 3 in Figure 4,
which is when the node is partly within the range or not within the range at all).
Figure 4: The three cases when capturing a range of source code using the Eclipse
AST API. Therefore, in general, CnP uses the character offset and length from the source
code to determine a clone’s location in a particular file. The actual source code that is
copied and pasted is represented to the largest continuous set of whole abstract syntax
27
tree (AST) nodes within the range. Although it is not said in [231], Codelink probably
also uses offsets, since they use a token-oriented rather than a line-based algorithm for
similarity comparisons between clones. So does CPC. LAPIS represents a text region as a
substring with a start offset and an end offset relative to the start of the file.
Some clone detection tools and clone management tools represent a clone’s
location by the file name that it is in with its line range, for example, Clonescape. The
problem with a line-based representation, however, is that it could give an imprecise
clone boundary because a single line may contain multiple statements. On the other hand,
the character offset representation would be able to pinpoint the exact range of all clones.
CloneTracker was the first to create a way to represent the location of clones
without using file name with character or line ranges. Instead, CloneTracker uses a
“clone region descriptor (CRD)”, which tells of the clone’s relative location in the file
using syntactic, structural, and lexical information (for example, the clone’s alignment
with code blocks). It is possible to use a CRD calculated for a code clone in an early
release to locate the same clone in future releases. However, CRDs may fail to locate
clones when the assumptions that the approach relies on are broken. CnP is guaranteed to
always provide accurate clone locations.
Clone Relationship A lot of clone-related research, such as [54, 55, 111, 137, 251], including this one,
refers to all similar clones belonging to a “clone group”. Other research refers to a clone
group as a “region set” [189] or a “clone class” [13, 40, 123]. In all of these cases, the
related clones are viewed at the same level of group membership symmetrically.
Clonescape, on the other hand, distinguishes the original as the parent and the duplicated
28
copy as the child. As a result, clones of the same parent can be called siblings. All related
clones form what they call a “clone family”. While it may be useful to know the clone’s
origin for comparison against the pasted code and for clone visualization, the origin
information could and should be separated from the basic clone model.
2.2.2.3. Clone Visualization Clone visualization can be an effective means to make programmers aware of the
clones in a system.
Markers – Colored Bars and Highlights The latest version of CnP’s clone visualization feature was improved to
distinguish clone groups (related, similar clones that result from a series of copy and
pastes) by coloring all clones within the same group with the same color of bars. It
distinguishes between the origin and its pastes by slightly darkening the colored bar that
is next to each pasted region. For example, in Figure 5, the origin was the method
“more_variables” (shown in the back), which has a regular shade of yellow for its
visualization bar (since it is the original code fragment that was copied), while its pastes
(the newly modified and related methods “more_arrays” and “more_functions”) are
shown with slightly more grayed versions of the color yellow. These three clone
instances belong to the same clone group, hence they are displayed with variations of the
same color (yellow). A different code fragment that is copied and pasted (belonging to a
different clone group) would be represented with shades of a different color, such as the
color red.
29
Figure 5: CnP clone visualization has distinction between clone groups and the clone
origin and its subsequent pastes. Visualizing clones is often a challenge that all clone-related tools must address.
Similar to CnP, CPC uses colored rulers to show the lines of each clone visually and
CloneTracker marks the lines of clones visually in the sidebar of Eclipse. Codelink
addresses the visualization issue by allowing similar parts of the clones to be hidden from
view (and indicating the commonalities between linked clones in blue and differences in
yellow). CSeR determines or infers each user-made change to clones as an insert, delete,
update, or move, and then highlights each kind of change with a different color.
Unchanged code within a clone is not highlighted. Mouse hover events reveal details
about the change, including what the updated code was before in the original and what
30
has been deleted from the original. A screenshot of CSeR’s highlights and hover
information is shown in Figure 6.
Figure 6: CSeR shows the changes that would be made to the
ExclusionInclusionDialog class (highlighted code for inserts, deletes, updates,
moves; and hover information for deletes, updates) to make the
SetFilterWizardPage class in SetFilterWizardPage’s file in the Eclipse editor. The four kinds of user-made differences between related clones, according to CSeR, are:
1. Insert – the addition of an AST (abstract syntax tree) node, highlighted in green.
2. Delete – the removal of an existing AST node, highlighted in red.
3. Update – the modification of an existing AST node, highlighted in yellow.
4. Move – the difference between the matching statements of the clones is that they
have different neighbors, highlighted in blue.
31
Differencing and Comparison Tools Some research looks at comparison [153] and its application, including comparing
source code examples [42]. Differencing tools must somehow show the differences
between files visually to the user. Though visualization is still a challenge to these tools,
most are very simple in how they display files’ differences, and the main distinguishing
feature to these related tools is the choice of differencing algorithm used.
There are many text-based differencing tools available. Most make use of the diff
algorithm [99, 241] and are based on solving the LCS (Longest Common Subsequence)
problem. Since this approach is developed for text files, it has obvious disadvantages
when used for Java source code [106, 107]. Some differencing tools that are based on the
diff or LCS algorithm include UNIX Diff, Eclipse’s Compare Editor (which can be
invoked by right clicking selected file(s) in Eclipse’s Package Explorer view and then
choosing the “Compare With” menu option), Ldiff [32, 33], and Version Editor (ve) [7]*.
Ve provides tight integration of the revision history and the editor so it has the limitations
and disadvantages of the text-based tools and the version control system.
There are a variety of graph-based differencing algorithms [5, 230, 233] and tools
such as Cdiff [25, 259], Jdiff [5], Semantic Diff [105], and Exas [193]*. The graph-based
approach has an advantage over the text-based tools, which only focused on syntax, since
these take into account the program’s semantics as well. However, they can be slower
and it is not always clear whether the extra analysis pays off.
* http://directory.fsf.org/project/diffutils/, http://help.eclipse.org/help32/topic/org.eclipse.platform.doc.user/reference/ref-25.htm, http://sourceforge.net/projects/ldiff/, http://ix.cs.uoregon.edu/~datkins/ve.html * http://www.ece.iastate.edu/~nampham/projects/clone/Exas/
32
Many differencing tools are abstract syntax tree (AST)-based such as LaDiff [37],
Breakaway [41], Jigsaw [43, 44], ChangeDistiller [66], and Coogle [215, 216], including
CSeR [106, 107]*. These tools in general have the advantage of being able to obtain
structured information from the tree-based representation of the source code. CSeR
differs from these tools in terms of its purpose (clone differencing), its interactive and
incremental updating of correspondence rather than re-computing from scratch (in
contrast to what is done in Breakaway [41]), and the heuristics that it uses to infer change
categories (which differs from, for example, those of ChangeDistiller [66]).
Another way of looking at program changes is to use mapping or origin analysis
as part of the differencing algorithm [138, 205] or the tool implementation such as Beagle
[75]. More recently, additional logic has been incorporated as well to get a better
understanding of the changes. The UMLDiff approach tracks the evolution of higher-
level program elements (at the level of UML models) over versions of systems [256, 257,
258] and other research utilizes a novel rule-based and combination algorithm (LSdiff)
[133, 136] to infer regular change patterns and overcome some of the disadvantages of
the other differencing approaches.
Capturing Program Structure and Edits There is a body of research that proposes structure-based editors and semantics-
preserving editing environments [16, 27, 80, 139, 140, 141, 188, 199, 200, 203, 204, 206,
219, 229, 250, 261] rather than traditional text-based editors. These structured editors and
IDEs can benefit programmers by letting them know exactly which edit operations are
being performed, however these specific, and often stand-alone, editors are not
* http://lsmr.cs.ucalgary.ca/projects/breakaway/, http://lsmr.cs.ucalgary.ca/projects/jigsaw/, http://www.clarkson.edu/~dhou/projects/CnP/
33
commonly used in practice. Instead, other research focuses on determining and
presenting structural correspondence [41, 43, 44, 106, 107, 193] to programmers in the
IDEs that they already use, like Eclipse, by utilizing the tree-based representation of the
source code. Rather than bombarding the programmer with too much extra information,
CSeR makes a few general categories of possible user-edits and infers which category a
sequence of edits belongs to incrementally. How to efficiently parse code is a research
problem itself [249]. For better performance, CSeR only compares the smallest
corresponding sub-trees that contain the positions where the programmer last edited.
Capturing Program Changes Not only is it important to capture the current state of the clones in a system (by
continuously updating clone locations and contents as they are changed), but capturing
change information over time and presenting this to the programmer can be extremely
beneficial. CSeR captures and displays certain clone changes in the editor, which can
help programmers see the level of similarity between the clones better. Seeing update and
deletion information that otherwise is not shown in the file can also be very useful in
learning about the code [186, 187]. Related research in the area of software evolution
looks further into program changes and multi-version programs [31, 37, 66, 77, 129, 131,
133, 134, 136, 154, 205, 262], changeability [177, 178], and evolutionary history [256].
Specifically studying the evolution of clones over multiple versions of the program helps
determine whether these clones require frequent consistent changes or whether they
… the problem [with software projects] isn’t
change, per se, because change is going to
happen; the problem, rather, is the inability to
cope with change when it comes. - Kent Beck
34
remain dormant and impose no significant maintenance challenges. It can also pinpoint at
what stage clones are refactored (when they are changed in form) and it can conclude
whether the clones need to be refactored at all [1, 135]. Seeing the code as it has evolved
over time in a version control system instead of just seeing the current version in the
editor can be extremely beneficial in learning how and why the program changes [14].
Using Change Information from Version Control Systems There is a large body of research that focuses on mining software repositories and
then analyzing the historical information from version control systems, such as SVN
(Subversion) or CVS (Concurrent Versions System), for a variety of reasons [6, 7, 15, 31,
33, 70, 82, 133, 134, 154, 262]. Clone-related tools that use version control system
information include Cleman and ClemanX [194, 195, 196], Clever [197], Clone
Detection Toolbox [190], Clone Smell Extractor [13], and Vaci [119]*. However, this
approach is limited since the information obtained is only from snapshots of when the
program’s source code was checked-in or checked-out and it often requires additional
analysis and inferences to be useful. Furthermore, the program histories may contain a lot
of irrelevant information that is not clone-related. Given program version changes, people
would need to sort through to detect likely copied-and-pasted code and eliminate extra
information. Also, although people might be able to obtain information about specific
changes made to a particular file, they would not automatically have correspondence
information (between files) from the histories alone.
* http://www.ece.iastate.edu/~nampham/projects/clone/Cleman/, http://www.ece.iastate.edu/~nampham/projects/clever/, http://www.ccfinder.net/vaci.html
35
Warnings – Error Prevention or Detection Not only is support for clone management important, but the prevention or
detection of clone-related errors (also called bugs [13, 111, 171, 172, 174],
inconsistencies [111, 171, 172], or anomalies [251]) should also be provided. CnP
contains features that may either prevent errors (like CReN does) or detect potential
errors (warnings) in the tracked clones. CnP issues a warning if any identifier in the
pasted code binds to a declaration in the context where it is pasted (external identifier
scoping) [95, 96, 97]. For example, when a method is copied and pasted within the same
class, CnP can provide a warning for each identifier within the method that is defined at
the class level (outside of the method, but within the class). These warnings will alert the
programmer that these particular identifier instances within the clone (method) may need
to be renamed. This is useful, since it is common for programmers to copy and paste a
code fragment that contains references to external identifiers that are intended only in the
original fragment. The programmer can then use CReN to rename the identifier instances
in the pasted location, if desired.
There are a number of software quality tools [26, 30, 50], including Axivion,
CloneDetective [116], ConQAT, and PMD,* and clone bug detection/prevention tools
such as CP-Miner [171, 172], CPC [251], DECKARD-based tool [111], and FixWizard.*
The famous Alice software prevents syntax errors by providing a drag-and-drop
programming system to aid novice programmers [127], who tend to make errors by
misunderstanding program constructs [260] and breaking implied system rules [58]. Bug
* http://www.axivion.com/index-en.html, http://conqat.cs.tum.edu/index.php/CloneDetective, http://conqat.cs.tum.edu/index.php/ConQAT, http://pmd.sourceforge.net/ * http://opera.cs.uiuc.edu/Projects/ARTS/CP-Miner.htm, http://cpc.anetwork.de/, http://wwwcsif.cs.ucdavis.edu/~jiangl/research.html, http://www.ece.iastate.edu/~nampham/projects/fixwizard/
36
detection, on the other hand, (rather than prevention) is often done by finding
inconsistencies between the clones [118, 128, 255] when changes are made [13, 161],
especially inconsistencies in identifiers [111, 171, 172], and spelling errors [45, 98, 192].
CP-Miner uses identifier mapping such that an identifier is considered consistent
when it always maps to the same identifier (which could be a different name) in the other
fragment and it is inconsistent when it maps itself to multiple identifiers [171, 172]. For
Example 1 in Table 4 of Section 2.3, the identifier “prom_phys_total” in the copied code
fragment maps to both “prom_prom_taken” and “prom_phys_total” in the pasted code.
Because “prom_phys_total” does not map only to “prom_prom_taken” in all instances,
for example, CP-Miner would detect it as an inconsistency. The DECKARD-based tool
claims that an inconsistency exists if the two code fragments contain different numbers of
unique identifiers [111]. For Example 3 in Table 4 of Section 2.3, the DECKARD-based
tool would count two instances of the identifier “l_stride” in the copied code fragment,
but only one instance in the pasted code fragment. Since both instances of “l_stride” were
not renamed to “r_stride” in the pasted code, for example, the DECKARD-based tool was
able to find this inconsistency. However, both CP-Miner and the DECKARD-based tool
produce false positives, which need to be inspected manually in order to verify the
existence of an actual bug. Similarly, the clone smell detection tool [13] also requires
human intervention to determine if the detected “unusual” changes are, in fact, bugs.
Instead of being retroactive in terms of bug prevention and detection, CnP provides a
form of automatic bug prevention (with its CReN and LexId renaming tools) and can give
warnings on code as it is being edited.
37
Alerts – Clone Modification Notification Clone modification notification is a new feature found in clone-related tools.
Clonescape alerts programmers when they edit a clone by showing a red status line
message. CloneTracker uses notifications to alert programmers when tracked clones are
being modified (for example, so that they can choose to turn on the simultaneous editing
feature). CPC uses notifications to warn the programmer about possible update
anomalies. Clones can be marked as “ignored”, meaning that no more notifications will
be generated for this particular clone. CnP lets the programmer know via visualization
that a clone is being or was edited (boxes with CReN/LexId, and highlights with CSeR).
Views and Graphs Research in the area of clone detection visualizes clones with graphs [83, 114,
232, 228] and views [14, 208, 228]. CnP provides two views: one view to list the clone
detection tool results that are reported and one view to list the clones being tracked by
CnP [96]. Clones that are being tracked by CnP can be either clones that have been
automatically tracked since they were copied and pasted in the IDE or clones that started
being tracked after they were manually imported from the clone detection tool. The
LAPIS editor suggests three possible views for future work, including a bird’s-eye view,
an abbreviated context view, and an “unusual matches” view. CloneTracker uses a view
to list clones and clone groups. Clonescape proposes a multi-view approach, where only
the one or two views of interest are automatically shown to the programmer at one time, a
technique known as fisheye view, but these are unimplemented. CPC contains a few main
views, including a clone list view, tree clone view, and a clone replay view. The use of
graphs and views, like markers, is an issue that all clone-related tools face. The challenge
38
is to find an alternative to the separate views that programmers need to invoke and the
relatively complex graphs that they need to learn and understand.
2.2.2.4. Clone Persistence While all software tools make use of data structures (such as vectors and maps)
that store information in the system’s memory while the tool is currently being run, this
information must be recorded in some way so that it can be accessed and updated when
the programmer works on the source code again at a later time. Storing the clone
information between programming sessions is what is called clone persistence.
CnP persists the information about the tracked clones between sessions in a flat
database (simple text file). Specifically, it stores each clone’s location (the file name that
contains the clone with the clone’s starting character position in the file and its length in
number of characters) within each clone group. In addition, as part of the information
needed for consistent identifier renaming within code fragments, each identifier’s
location (the identifier’s starting character position in the file and its length in number of
characters) within each identifier group is also stored. The information gets saved
automatically whenever Eclipse quits, and loaded automatically when Eclipse starts up.
This single file covers the whole workspace, not just individual projects.
CPC also persists clone information. Codelink saves the links between clones as
file meta-data, making the links persistent between sessions. However, the persistent
links are not robust to edits. The latest version of CloneTracker persists the clone
information that it tracks for the current project. A unique feature of CPC is that it also
gathers information about the copying and pasting activities in general, and it persists the
full modification history of each clone in relation to its clone group.
39
2.2.2.5. Clone Documentation and Clone Attributes Some people believe that clone tracking and visualization act as a form of source
code documentation by themselves. Though many tools claim to “document” the clones
that they are tracking or managing in this way, clone documentation is actually defined as
support for additional information to be written about the clone (which forms the clone’s
external attributes). Clone documentation, such as why the clone was created (for
example, for hardware variation or a bug workaround [38]), generally cannot be retrieved
by the system and must be added by the programmer. Clonescape and CPC define “clone
classification”, however, their approaches include documenting the structural information
about clones, which does not fully fit into the previous definition. Similarly, other
research in the topic of clone classification [121, 123, 124] groups clones by the region
that they occur in (the level of abstraction involved and the location of the clones in a
file), which also does not require user intervention. Instead, the “reasoning” type of clone
classification described by the authors of Clonescape [38] is consistent with the definition
provided here. In addition, the clone attribute “severity” that is set at low, medium, or
high by the programmer depending on whether the clone should be removed from the
system, is an example of a resulting clone attribute according to the above definition.
2.2.3. Clone Lifecycle Support Proactive clone management must be actively done at all times during software
development and maintenance, throughout a clone’s lifecycle. When designing CnP and
reviewing related work, various definitions of clone properties (in the previous section)
and a variety of tool support (and lack of tool support) for each stage of the clone
40
lifecycle (in this section) were learned. In this dissertation, the clone lifecycle is explicitly
defined (shown in Figure 7) from clone creation, through clone capture and clone editing,
to clone extinction. The following subsections present a variety of tool support for the
phases of a clone’s life. Table 2 (on a following page) summarizes the design and
implementation details for each of the related clone tracking tools: Clonescape [38], CPC
[251], Codelink [231], LAPIS [189], and CloneTracker [54, 55], including CnP [95, 96,
97, 100, 101, 102] (and its parts: CReN consistent identifier renaming [103], LexId
consistent substring renaming [104], and CSeR clone comparison [106, 107]), and it
specifically highlights the problems that the related tools did not address that CnP does.
The emphasis of these six tools, in particular, is in supporting the editing phase of the
lifecycle to avoid inconsistent modifications to clones.
Figure 7: The Clone Lifecycle – Clone Creation, Clone Capture, Clone Editing, and
Clone Extinction.
41
Table 2: Summary of Clone Tracking Tools with their Clone Lifecycle Support
42
2.2.3.1. Clone Creation When the concept of clone creation is considered, two questions come to mind:
how were the clones created, and why the clones were created? The answers to these two
questions determine whether or not that particular clone is tracked and supported by the
clone tracking tool.
How were the clones created? Code clones can be created in a number of ways [115], but many, if not most,
clones are undoubtedly created via copying and pasting, since duplication is very easy to
do with either a simple menu selection (Edit - Copy, Edit - Paste) or a keyboard shortcut
(Ctrl+C, Ctrl+V). As a result, the software tool CnP, which supports copy-and-paste-
induced code clones upon creation, essentially captures one of the most common kinds of
clones made and it guarantees 100% accuracy in clone “detection”, since the copying and
pasting is known exactly as it happens.
Why were the clones created? A key distinction between clones is also the reason for the clone creation. Existing
research distinguishes between intentional clones (code that the programmer intended to
reuse) [212, 213] and accidental clones (code that is similar due to a protocol
requirement) [2]. This and most other clone-related research focus on intentional clones,
but realize that accidental clones do exist. To address accidental clones, tools often allow
some form of user control such as allowing the programmer to remove certain clones
from those that are automatically being tracked.
43
One research paper has categorized the high-level intentions of copy-and-paste as:
to relocate, regroup, reorganize code, to reorder code, to reformat code, to help remember
a long, complicated name, to restructure or refactor code manually, and to reuse code as a
structural template (either a syntactic template or a semantic template) [130]. Similarly,
another set of research publications has introduced a categorization of eight cloning
patterns into three groups: forking, templating, and customization [122, 125]. Forking,
which involves duplicating a large portion of code to be evolved independently, includes
hardware variations, platform variations, and experimental variation. Templating
involves the direct copying of existing code, where the appropriate abstraction is
unavailable. Cloning patterns in this group include “boiler-plating” due to language in-
expressiveness, API/library protocols, and general language or algorithmic idioms.
Finally, customization involves copying existing code that solves a similar problem to the
current problem and modifying it accordingly. It includes cloning patterns such as bug
workarounds and “replicate and specialize”. CnP focuses on the last kind of cloning,
which is again reported as “the most common type of cloning” [122, 125].
2.2.3.2. Clone Capture Clone capture refers to how and when clones are detected in the source code in
order to be tracked by the tool. Proactive clone capturing detects a clone upon creation,
while retroactive capturing often detects a clone later on in its life. It is important to have
both proactive and retroactive support, since the proactive approach only catches newly
created clones, while the retroactive method can detect clones that were made before the
proactive tool was applied.
44
Tracking Copy-and-Paste Actions (Proactive) Clone tracking happens behind-the-scenes, keeping track of the clone and clone
group relationships and their locations within the source code files. CnP automatically
tracks only significant copy and pastes that contain certain program elements or number
of lines of code (based on a configurable policy). The tool tracks code fragments at the
granularity of a character, to the nearest contained abstract syntax tree (AST) node. The
AST representation of the source code is accessible within the Eclipse framework.
By default, the CnP tool automatically detects the creation of new clones that are
made via copying and pasting, and it was the first known published tool to do this in the
IDE. CnP is able to track each clone from its beginning to end of life, unlike retroactive
tools which do not capture a clone’s creation, but instead rely on clone detection tools or
user selection and only start tracking clones later in their lives. After creation, CnP tracks
clones by storing information about them. The location and range of a clone is
continuously updated as the code changes.
For the tools that focus on tracking copy-and-paste-induced clones as they are
made by copy-and-paste (CnP, Clonescape, and CPC), it is important to note that still not
all of these copied code fragments are necessarily clones of interest to be tracked.
Research on copy-and-paste usage patterns shows that programmers often copy very
small pieces of code (such as a variable name, a type name, or a method name) [132]. To
avoid tracking the clones produced when the programmer is simply copying and pasting
less than a single line of code to save typing or to remember a name’s spelling, CnP
chooses to automatically track only clones that would appear to be more significant.
45
The policy that CnP currently uses for determining whether a copied code
fragment is a clone suitable for tracking is if and only if it contains:
1. more than two statements, or
2. at least one conditional statement, loop statement, or method, or
3. a type definition (class or interface).
This policy can be made configurable. On top of this basic filtering, programmers are
also given the option to take a clone out of a clone group. Furthermore, if a paste is done
within an already tracked clone, CnP treats this new paste the same as an insert operation,
which just grows the existing clone, instead of tracking “clones within clones”.
Clonescape and CPC are both token-based (rather than AST-based), which has its
limitations (mentioned previously). Clonescape’s policy for tracking more significant
clones is to track clones of at least 30 tokens. CPC’s policy is to track clones above a
programmer-specified minimal clone length in terms of characters, tokens, or lines. Both
of these clone tracking policies are limited in that generally clones are not significant
based on numerical size alone (different programming languages and styles may allow a
significant amount of code to be written in a relatively small number of characters,
tokens, or lines). The presence of multiple statements or an abstraction in the copied code
is more important than its actual size and can indicate that the code is, in fact, worth
tracking, since it is more likely to have been copied in order for the statements or
abstraction to be reused.
Codelink, LAPIS, and CloneTracker all do not track copy and paste actions, but
instead rely on importing from clone detection tools and/or the selection of clones.
46
Importing from Clone Detection Tools and/or Selecting Clones (Retroactive) In addition to the proactive support of tracking clones since their creation, CnP
also includes retroactive support, specifically importing clones that were detected with
the SimScan clone detection tool. After SimScan is run, its reported clones can be
displayed in a view in CnP. Since these imported clones have the limitations of the clone
detection tool, each clone’s location is listed by the file’s name that it is in and its
position within that file (in line range notation). The final stage of the importing process
is that the programmer selects which of the reported clone groups to import into CnP. As
a result, only intentional and significant clones should be selected, while others should be
weeded out. SimScan’s representation of the selected clones is converted into CnP’s
format and then these clones can start being tracked by CnP. In addition to their file
locations and positions, native clones (those that have been tracked by CnP since their
creation via copy and paste) also have identifier information associated with them.
CloneTracker supports clone importing from SimScan and then relies on the
programmer’s input in clone selection. The retroactive portion of Clonescape, although
not completely implemented, also chooses to rely on the programmer’s selection of
clones from the clone detection tool’s results. CPC, on the other hand, just marks the
reported clones differently than the copy-and-paste-induced clones in the system, without
requiring the programmer’s selection. But, CPC still requires all clones, including the
imported ones, to go through an automated filtering process. In Codelink and LAPIS,
clones are just selected manually by the programmer (no clone detection tool is used).
Though this eliminates some tool processing overhead, the extra burden is put on the
programmer to know which clones to select and where they are in the system.
47
2.2.3.3. Clone Editing Clone editing is broadly construed as any feature that supports the editing of clone
code and the maintenance of the clone model. Program and clone modification is one of
the most cited problems of cloning, since it can result in errors and inconsistencies, so
many people choose to refactor the clones in order to avoid having to maintain them.
However, a new research perspective proposes to manage the edits to clones rather than
“solving” the maintenance problem by removing the clones, especially since some clones
should or must remain in the system. This section summarizes source code editing
techniques that make consistent edits within a clone (intra-clone editing) and between
clones at the same time (inter-clone editing).
Inter-clone editing is done when a physical change that is being made to one clone
is the same physical change needed between all related clones (similarity). On the other
hand, for intra-clone editing, only the relationship is the same between the clones, not the
physical change itself, for example, in the case of a local identifier name change
(analogy).
Inter-Clone Editing (Between Clones) There are times when a change is needed to be made consistently between clones,
such as a new feature or a bug fix. Fixing an existing bug that was replicated via copy
and paste before would need to be done in all related code fragments, otherwise these
The fundamental problem with program
maintenance is that fixing a defect has a
substantial (20-50 percent) chance of introducing
another. So the whole process is two steps
forward and one step back…
- Frederick P. Brooks, Jr.
48
unchanged fragments (“rogue tiles” [245]) would actually create a new inconsistency or
bug into the system in addition to the existing bug.
There have been a number of editing techniques proposed by researchers to help
maintain common updates and changes between similar code fragments (linked editing
with Codelink [231], synchronous editing [122], and simultaneous editing with LAPIS
and CloneTracker [54, 55, 189]). Inter-clone editing provides the same benefit of
updating in one place that an abstraction would provide, but allows the clones to remain
in the system. Each inter-clone editing technique has been implemented as either an add-
on tool (that integrates into an existing editor or IDE) or as its own system. However,
each implementation requires that the code be manually selected by the programmer first.
All tools allow the programmer to “undo” or correct the automatic edits if necessary.
Although not currently available, support for inter-clone editing between clones
can be added to CnP as an additional way to ensure consistent clone modifications across
the whole system.
Intra-Clone Editing (Within Clones) A common modification that a programmer makes is not between clones, but
rather within a clone, for example, to modify pasted code to fit the current task. Research
recognizes that a common type of clone is a parameterized one [9, 12, 47], where the
programmer intends to make only small changes to the pasted code, like to the identifier
If you were to break up a program, put it into a
grinder, and then sort the pieces, you would find
that the bulk of the program is in names.
- Charles Simonyi
49
names and literal constants. CReN and LexId were the first known tools to aid
programmers in consistently renaming identifier and substring instances within a clone,
called intra-clone editing.
CReN CnP contains a renaming utility (CReN) [103] that helps rename identifiers
consistently within clones, shown in Figure 8 (and within any specified code fragment,
shown in the bottom of Figure 19). CReN uses a heuristic that identifiers referring to the
same program element within a code fragment should be renamed together consistently.
In this way, CReN helps prevent inconsistent renaming errors because manual renaming
can miss an instance that was intended to be renamed. When the declaration is outside of
the fragment, the missed name can still be okay according to the compiler (since it is still
in scope), so programmers would not normally be alerted of the missed instance. The
tool’s automatic actions can be corrected by the programmer. This novel renaming
feature is unique to the CnP tool and is one way that it supports clone-aware editing
differently than other tools.
As a simple example, suppose that the programmer needs to make a method to
find the range of an integer array. Code has already been written to find the lowest
integer in the array of integers, shown as the first for loop in Figure 8, so the programmer
would need to write code to find the highest integer in the array. [In Figure 8, the original
code is colored with a red bar beside it, and the newly pasted code is colored with a blue
bar beside it. This screenshot was made before CnP’s visualization was improved to
display related clones from a sequence of copy and pastes in the same shade of color.]
With the existing loop and the variables “low” and “i” already present, the programmer
50
could add in the new variable “j” and then copy and paste the declaration of “low” and
the for loop together, creating an exact duplicate. CReN highlights the copied code (the
origin) with a red bar and the pasted code with a blue bar. After changing the “<”
operator to “>”, the programmer would then like to rename all instances of “low” to
“high” in the pasted code and all instances of “i” to “j” in this loop. Manual renaming can
miss an instance that is undetected by the compiler. In Figure 8, with CReN, all instances
of “i” in the pasted loop are renamed to “j” when any “i” in the loop is renamed by the
programmer.
Programmers may implement the example in Figure 8 differently (by using a
single for loop instead of two loops). The choice of implementation depends on the
programmer’s style. Some programmers may prefer the implementation outlined here,
since it demonstrates a clear separation of concerns of finding “low” and “high”. The
solution is presented here in this way as a simple illustration to follow that does not
require much explanation.
Figure 8: Consistent identifier renaming within a clone using CReN.
51
To know which identifier instances in a code fragment are to be renamed
consistently together, CReN groups together instances that bind to the same program
element. If bindings are not available, CReN will simply rename identifier instances
within the clone that have the same name. Even though CReN only renames identifier
instances consistently within a single clone, CReN must still also maintain relationships
between it and the other clones in the same clone group. This additional mapping is used
when information needs to be updated across all clones, such as when the programmer
chooses to use CReN’s “side-stepping” feature to remove an identifier from the group of
instances that are to be renamed together. In this situation, when the programmer
removes a particular instance in the clone that is currently being edited, CReN would also
exclude that corresponding identifier instance from being renamed with the others in all
related clones, as shown in Figure 9.
Figure 9: The programmer can choose to rename an instance separately from the
others (notice that one “i” in the pasted loop on line 33 is not being renamed as a “j”
with the others anymore).
52
Grouping Identifiers Figure 10 shows an example of an abstract syntax tree (AST) for the for loop:
for(i = 1; i < size; i++)
{
if(array[i] < low){
low = array[i];
}
} All nodes have a type and some nodes have values. To read the diagram with the source
code, start with the root node. The root node is of type ForStatement and corresponds to
the “for” part of the source code. This node is the highest node in this sub-tree (this is a
partial AST, since this for loop must be just a part of a whole, larger source code file) just
as the for loop contains all of its parameters and statements within it. In this example,
identifiers, tokens (literal constants), and operators have values. For example, a node of
type Assignment has an operator value of ‘=’. The first parameter of the for loop “i = 1”
is its own sub-tree with its root as the = operator. The identifier ‘i’ is the left branch of the
sub-tree in a node of type SimpleName. The token ‘1’ is the right branch of the sub-tree
in a node of type NumberLiteral. And so on, for the rest of the parts of the for loop.
Notice that all identifiers and tokens are leaves of the tree. To gather groups of identifiers
CnP “visits” (using the visitor design pattern) nodes of type SimpleName and stores
instances of the same identifier together. The for loop in this example has four identifier
groups, each displayed in a different color: the “i” instances are all colored pink, the
“size” instance is colored purple, the “array” instances are colored green, and the “low”
identifier instances are colored blue. The grouping of identifiers is used in CReN and
LexId, in particular, for the consistent renaming of all instances of the same identifier.
53
Figure 10: The abstract syntax tree (AST) of a for loop with the identifier groups highlighted.
54
LexId LexId [104] is an extension of CReN that renames parts of identifier names
consistently together within code fragments. All instances of a common substring
between all identifiers (which can be different whole identifiers) within a clone are
renamed together as one of them is renamed. LexId handles a different use case than
CReN and instead focuses on inferring the lexical patterns across different identifiers.
LexId determines substrings as those parts of an identifier that are separated by an
underscore “_” or dollar sign “$” character, or by changes in character type (digits or
letters) or case (uppercase or lowercase letters). The “_” and “$” are never substrings or
part of a substring (they are strictly only separation characters). Table 3 shows some
examples of how LexId’s algorithm divides up an identifier into substrings. The standard
Java CamelCase naming convention is supported. LexId also properly divides C-style
identifiers as expected with respect to capitalization and digits and even supports
Hungarian notation [220].
Table 3: Examples of what LexId considers to be substrings.
55
Inferring Patterns in Identifiers Programmers can copy and paste when following a certain naming scheme. In this
situation, a substring of the identifier in the pasted code remains the same to the existing
instances in the copied code, while the modified portion may follow a convention or
standard. For example, in GUI programming, symmetric naming conventions like
“leftButton” and “rightButton”, or “topPanel” and “bottomPanel”, are often used. In this
scenario, LexId can be made to maintain a database of the naming pairs (left/right,
top/bottom, etc.), which it could use to infer the other name in the pair when the code is
copied and pasted, as in Figure 11. Currently LexId does not yet infer the other part of the
pair, but it does rename the same substring instances together as one is being edited.
Figure 11: LexId changes the substrings “left” to “right” when one is edited. In the
future, LexId can be made to automatically infer the substring “right” in the pasted
code based on “left” by maintaining a database of common naming pairs.
Another example of a naming pattern in identifiers is using numbers in the
identifier name. In this situation, the 1, 2, etc. are characters of a string, but can be treated
separately from the remaining part of the identifier name. In Figure 12, the programmer
56
wishes to change the “a1” and “a2” to “x1” and “x2” in the pasted code, and “b1” and
“b2” to “y1” and “y2” in the pasted code. To rename both substrings of “a” to “x”
together and both substrings of “b” to “y” together requires LexId to separate the whole
identifier into separate editable substrings that are grouped together according to the
inference.
Figure 12: LexId renames a substring “b” to “y” consistently in pasted code. Inferring Patterns in Tokens In addition to inferring patterns in identifiers, which is also shown on the right
side of the equals sign in Figure 13, LexId could also include support for tokens. In the
AST, identifiers are of type SimpleName and are leaf nodes, while tokens are of type
NumberLiteral and are also leaf nodes. Tokens are often used in initializations (for
example, in the code “int i = 1;” the 1 is a token). Figure 13 shows a use of tokens for
array access (on the left of the equals sign). Another feature addition to LexId could be to
infer patterns, such as incrementing numbers, similar to how pulling down a column in a
spreadsheet program can increment the number on the next line. The programmer might
have copied and pasted each line of code only to change the number value on each line.
With LexId, the programmer can specify the desire to auto-increment by pulling a code
fragment with the mouse cursor, and the tool would begin duplicating the lines, only
57
modifying the number each time. The tool could treat these new lines of code as “copied
and pasted” and start tracking them as clones.
Figure 13: A new feature of LexId can be support for auto-incrementing tokens
(left) as well as lexical patterns in identifiers (right). Inferring Patterns in Types and Subtypes Another kind of inconsistency that LexId can be made to help prevent is type
inconsistencies in copy-and-pasted code. When code is copied and pasted, there might be
a certain type or subtype that can be inferred at a certain position in the pasted code based
on the type that is at that position in the original code. For example, in the code of the
Structural Constraint Language (SCL) plug-in, many subclasses were created from an
existing superclass by copying and pasting it and then modifying the subclasses. One
commonality among all of the subclasses is that they all have a method called
copyExpr(), which is shown in Figure 14. The “XXX” represents the name of the current
subclass and “arg” is a field of the superclass (“YYY”). When the pasted class’ name is
being modified, LexId can infer that the constructor that is being called in the copyExpr()
method should also be the same name. In this case, LexId could just give a warning to the
58
programmer about the inferred type inconsistency instead of doing the renaming
automatically (error detection instead of error prevention). [The current default of LexId
would perform the renaming based on the common identifier name.] The compiler would
not warn the programmer about this kind of inconsistency, since the compiler does not
infer the programmer’s intention and the code is otherwise correct (for example, if the
constructor is defined and accessible).
Figure 14: LexId can be made to infer that the constructor that is called within a
common method should be the same as the current subclass’ name (“XXX”). Inferring the Programmer’s Intention In order to work automatically, CnP infers the programmer’s intention such that it
assumes that the programmer is copying and pasting in order to create intentional clones
and that the programmer wishes to modify the identifiers within a clone consistently
(which is done with CReN and LexId). Though the programmer can change this default
behavior by removing any clone or removing any identifier instance from being tracked,
CnP must infer the most common editing scenarios to operate without interruption.
Instead of assuming the programmer’s reasons for copying and pasting, a new
tool named CloneBoard* [46, 47] intercepts the copy and paste operations and directly
asks programmers why they are cutting, copying, or pasting. Previous research suggested
replacing the standard cut, copy, and paste operations with the four choices of move,
* http://swerl.tudelft.nl/bin/view/Main/CloneBoard
59
copy-identical, copy-and-change, and copy-once, but CloneBoard instead gives the
programmer a list of “clone change resolution strategies” to choose from such as: to
parameterize the clone, to unmark the clone’s tail (remove tokens from the cloning
relation at the end of the fragment), to unmark the clone’s head (remove tokens from the
cloning relations at the start of the fragment), to postpone resolution, to unmark the clone,
to apply changes to all clones, and to ignore changes [47].
Other research in the area of intent inference, include tools such as PR-Miner
[173], Prospector [181], and Strathcona [87, 92]*. Some research [155, 156] utilizes
semantic information to tell the intention of the program. Still, many continue to rely on
the help of the programmer to specify his or her intentions [169]. The concepts of
intentional programming [29], which enables source code to reflect the intentions that
programmers have in mind when conceiving their work, were implemented in an IP
(intentional programming) IDE, which allows domain experts and programmers to work
together to describe the program’s intended behavior in a “what you see is what you get”
(WYSIWYG) way [222].
No matter if a tool assumes the programmer’s intention or explicitly asks for it,
the tool should always have some support for user control.
Identifier Names and Identifier Tracking Tools A variety of research looks into the topics of identifier names [4, 34, 35, 36, 48,
49, 62, 163, 164, 165, 166, 167, 168], which emphasizes the importance of naming
choice in programming. It is argued that more concise and consistent naming (for
example, following a particular programming style or naming convention) can help avoid
* http://opera.cs.uiuc.edu/Projects/ARTS/PR-Miner.htm, http://www.cs.berkeley.edu/~mandelin/, http://lsmr.cs.ucalgary.ca/projects/strathcona/
60
inconsistencies in naming, making the source code more readable and understandable.
Furthermore, the improved identifier naming and better program comprehension can
“increase productivity and quality during software maintenance and evolution” [48, 49].
When applied to clones, CReN and LexId track the relationships of the identifier
instances between the related clones and the relationships between the identifier instances
of the same program element or name within each clone. A related identifier tracking
tool, Vaci*, detects contexts and identifiers, and then forms translation classes that can be
used to detect an inconsistency in naming [119]. Recall that CP-Miner and the
DECKARD-based tool (mentioned in Section 2.2.2.3 – Clone Visualization, since they
both show warnings to the programmer) use identifier mapping and identifier counting,
respectively, to detect inconsistencies in identifiers between copied and pasted code.
Though these two tools also use methods to detect identifier inconsistencies, they do not
track identifiers over software versions like Vaci does. While all of these tools (CP-
Miner, DECKARD-based tool, and Vaci) can detect possible renaming inconsistencies,
CReN and LexId instead proactively prevent the inconsistencies from occurring at all.
Renaming Tools The Find & Replace, Refactoring (Rename), Linked Renaming*, and Rename
Type Refactoring* features in Eclipse can assist a programmer with consistent renaming
in the IDE. Each has its own set of limitations and differences from CReN and LexId.
* http://www.ccfinder.net/vaci.html * Scroll down to the heading “Quick Assist” to read about Eclipse’s “Linked Rename” feature - http://archive.eclipse.org/eclipse/downloads/drops/R-2.1-200303272130/whats-new-jdt-editor.html * Scroll down to the heading “Rename Type …” - http://archive.eclipse.org/eclipse/downloads/drops/R-3.2-200606291905/new_noteworthy/eclipse-news-all.html#JDT
61
Find & Replace in Eclipse allows the programmer to find specified text and
replace it with another text (Figure 15). Find & Replace is simply a text-based search and
has no knowledge of the structure of the program. It does not infer intent and must be
initially requested by the programmer. In addition, Find & Replace is not limited to
within a clone code fragment, so the programmer must know where renaming in the
clone begins and ends and manually replace only those instances.
Figure 15: Find & Replace can rename all instances of “i” (as a whole word) to “j”
in the selected lines, but this needs to be specified by the programmer and is simply
a text-based search. The Rename refactoring allows the programmer to rename various program
elements. As such, binding is an important condition for it to work, which is not
necessary for CReN, as in Figures 16 and 17. Furthermore, Rename is automatically
applied to the whole project instead of a clone, as in Figure 18.
62
Figure 16: Rename Refactoring does not work with code that does not type check
(binding is required for it to work).
Figure 17: CReN works with code
that does not type check (binding is
not required for it to work).
Figure 18: Rename Refactoring is not
limited to renaming within a clone (for
example, only in the pasted for loop).
63
A case where CReN works and existing refactoring support does not is when the
programmer would like to switch the “i”s and “j”s between nested for loops. Consider
that the declarations of “i” and “j” are external to the loops and the outer loop contains
the “i”, while the inner loop contains the “j” initially. By using Rename Refactoring in
Eclipse, the programmer can first rename the “i”s to “j”s in the outer loop and its
declaration together. However, when the programmer wishes to change only the “j”s in
the inner loop, Rename Refactoring will change all instances of “j” to “i”, including those
in the outer loop. This is shown in the top picture in Figure 19. On the other hand,
performing the same sequence of steps with CReN on the entire “user-specified code
fragment” that includes the declarations gives the desired result, shown on the bottom of
Figure 19. Only the “j”s for the inner loop are changed to “i”.
Figure 19: Refactoring (top) vs. CReN (bottom). Linked Renaming allows the programmer to rename identifiers within a file
scope, while CReN can be applied across multiple files, shown in Figure 20.
Furthermore, Linked Renaming neither works with code that does not parse (Figures 21
and 22) nor renames identifiers only within a clone (Figure 23) as CReN does.
64
Figure 20: CReN works across multiple files (file 1 is on top, file 2 is on the bottom). Figure 21: Linked Renaming does not work with code that does not parse (notice the
added semi-colon between the ++ on line 33).
65
Eclipse’s Rename Type Refactoring feature allows similarly named variables and
methods within a class to be updated when the class name is renamed. For example,
when a class “Bar” is renamed to class “Foo”, the “fBar” variable and “createBar”
method (both of type “Bar”) within the class are renamed to “fFoo” and “createFoo” if
the programmer checks this option when invoking refactoring in Eclipse. LexId can do
this same thing, but does not limit itself to the class scope.
Figure 22: CReN works with
code that does not parse (notice
the added semi-colon between
the ++ on line 33).
Figure 23: Linked Renaming is
not limited to renaming within a
clone (for example, only in the
pasted for loop).
66
2.2.3.4. Clone Extinction Refactoring Refactoring is actually a form of editing, for example, to make the multiple,
related clones into a single procedure. The process of refactoring itself can be difficult for
the programmer and potentially error-prone. Refactoring is a stage of the clone lifecycle
after which the clones will be removed from the code base (clone extinction). Refactoring
is not just deleting the clones, though, it is instead changing the form of the clones that
are in the same clone group (that is, related clones) into the same abstraction with
differences between the clones reflected in the function’s parameters, for example. The
functionality of the clones is still there, refactoring just attempts to solve the maintenance
problem of clones by having updates made in a single location (for example, in the
function’s body) rather than separately in each clone. Essentially, converting code clones
into abstractions are a way of maintaining the unity of the clones.
Though many people propose refactoring out clones as part of “clone detection
and removal”, only a small amount of clone detection research has focused on the
removal part and there are relatively few tools that specifically support clone refactoring
(Section 2.1). In fact, all six clone tracking tools discussed in this chapter that have an
editing focus, including CnP, do not explicitly have clone refactoring support, though
they could. The authors of Codelink mentioned designing a feature to allow programmers
to move back and forth between clone and abstraction representations, but this was not
Refactoring improves the design. What is the
business case of good design? To me, it’s that you
can make changes to the software more easily in
the future. - Martin Fowler
67
yet implemented [231]. There is definitely a need to help programmers with the editing
and timing of the refactoring (to make sure that it is not done prematurely and to make
sure that refactoring is the appropriate programming decision) and it should especially
become a feature of tools that aim to support the full clone lifecycle.
Clone Divergence (Loss of Similarity) Refactoring is not the only method of clone removal, however. Another way in
which clones can become extinct is by clone divergence. With a lot of editing over time,
clones may naturally diverge or separate from one another such that they do not have a
significant amount of similarity remaining between them anymore. Unlike refactoring,
which kept related clones unified, clone divergence actually removes the cloning
relationship that the clones once shared. Though refactoring may at times be forced,
clone divergence is more likely to happen naturally if it happens at all. Clones that were
made via copying and pasting are likely to continue sharing some level of similarity or
else the original code would not have been duplicated to begin with.
Almost all of the clone tracking tools support clone divergence in some way. CnP
allows the programmer to remove a clone from the clone group. In this way, the
programmer can have full control over the clones that are considered related (or similar)
to one another. Future support includes the possibility of merging clone neighbors into a
single code region. Clonescape and CPC offer similar functionality for removing a clone
from a clone group. CPC also has the notion of an orphan clone, which is a clone where
all of its relatives in the clone group, including its origin clone, were deleted. This
remaining code fragment would not really be considered a clone anymore, since it does
68
not belong to any group, but CPC chooses to acknowledge that it once was a clone.
Codelink and CloneTracker both allow clones to be “linked” and “unlinked”.
2.3. Prevalence of Clones, Renaming, and Related Errors in Production Code
Prevalence of Clones in Production Code Research has shown that there is significant code reuse in both commercial and
open source software [59, 224, 225]. Just about all clone-related papers examine the
relevance of their work by showing that there are, in fact, clones in existing software.
Many times researchers run their own clone detection tool in order to test it and verify the
results. Other times a variety of clone detection tools are run for comparison of their
results. Also, techniques were made to make the analysis of large result sets easier [126].
There is no doubt that clones do exist.
Previous case studies [96] done with the CCFinderX and SimScan clone detection
tools on open source software including the Structural Constraint Language (SCL) plug-
in and the Eclipse JDT UI plug-in, showed that clones do exist in real-world systems and
thus there is a need for proactive clone management. For SCL, SimScan found 102 clone
groups, 70 which were considered intentional, useful clone groups (and not generated
parser code or accidental clones). Guessing which clones were likely to have been copied
and pasted (indicated by the presence of the same special comments or by their close
proximity to each other), it was determined that approximately 50 out of the 70
intentional, useful clone groups could have been supported by proactive clone
management. A larger number of clone groups were detected by CCFinderX and
SimScan on the Eclipse JDT UI source code. But after inspecting about 200 clone groups
69
of class-level clones, it was concluded that a proactive clone management system would
be useful and needed in this case as well.
Prevalence of Renaming in Production Code Just like it is difficult to tell if something was copied and pasted, it can also be
hard to tell retroactively whether something was copied and pasted and then renamed (or
that this code fragment was named differently to begin with). The study in [171] suggests
that most (65-67%) copied-and-pasted code fragments require renaming at least one
identifier. So, if something is known to have been copied and pasted from another code
fragment (meaning that the code fragments were once identical), then any name changes
made will constitute a renaming. Therefore, one way to guess that renaming has occurred
is to look at the same piece of source code over time. That way the source code’s original
names are known along with its later names, such that if they become different, then a
renaming can be assumed to have happened. Both Clever [197] and Vaci [119] determine
the correspondence between identifiers over software versions (in version control
systems) and can be used to determine renaming inconsistencies in this way. Other tools
(in the next subsection) try to detect identifier inconsistencies between clones in a single
version of the source code, which assumes that the clones were copied and pasted and
then modified.
Prevalence of Clone-Related Errors in Production Code
An analysis of commercial and open source systems found that inconsistent
changes are made to clones very frequently and many lead to unexpected behavior or
70
faults [117]. Though there are various kinds of clone-related errors, CReN and LexId’s
main focus is on identifier renaming inconsistencies, which will be elaborated on here.
There are examples from literature that show an inconsistent renaming of
identifiers within a copy-and-pasted clone in production code. Three examples are shown
in Table 4.
The first example in Table 4, published in [171], is from the file memory.c in
Linux version 2.6.6, which deals with programmable read-only memory (PROM). The
original code fragment (on the left) is a for loop that is copied and pasted and then
modified. In the modified pasted code fragment (on the right), the programmer intended
to change all instances of the array name “prom_phys_total” to “prom_prom_taken”. The
programmer unintentionally did not change one instance of the array’s name (in the last
line). The compiler did not detect this error because “prom_phys_total” is still in scope.
In this example, the for loop was copied and pasted within the same function: void __init
prom_meminit(void), which begins at line 68 in memory.c (not shown).
The second example that is shown in Table 4, from [174], is code that is part of
the GNU command “bc” (which is a binary calculator that can be run on the command
line), in the file storage.c. The original copied code fragment (on the left) is a function
named “more_variables” that allocates a larger amount of memory for the “variables”
array. It then copies the values over from the smaller array “old_var” to the larger array
“variables” (in the first for loop), and then fills in the rest of the space in the “variables”
array with NULL. In the modified pasted code fragment (on the right), the function’s
name was renamed from “more_variables” to “more_arrays”, the type “bc_var” was
renamed to “bc_var_array”, and all instances of the arrays “old_var”, “variables”, and
71
“v_names” were renamed to “old_ary”, “arrays”, and “a_names”, respectively. However,
one instance of the variable “v_count” in this function was missed and not renamed to
“a_count” (in the second for loop’s condition). Because “v_count” is defined as a global
variable, this copy-paste error was not detected by the compiler.
Table 4’s third example is from [111] and is code in the file dependency.c from
the GCC Fortran compiler. In this example, the identifier “l_stride” in the if statement’s
condition is also used in the if statement’s body. However, in the modified code
fragment, the “r_stride” identifier was supposed to be left as “l_stride”. This is a different
type of error than the other two, but is still an inconsistency in renaming that was not
caught by the compiler or the programmer during development.
All of the examples presented here contain inconsistent renaming errors that were
found in existing production source code (there are also many more examples of this in
practice). The hope is that the CnP tool would prevent this type of error from occurring at
all, by catching it during program development. This should be more cost effective than
detecting and fixing inconsistent renaming errors after they have happened. Existing tools
typically involve computationally expensive, sophisticated algorithms, like statistical bug
isolation [174], or running a clone detection tool followed by a number of error detection
and pruning algorithms, which still results in many false positives [111, 171]. However,
the proactive approach of clone-related error prevention does not replace but still only
complements existing error detection tools, which are needed to find potential errors in
legacy code.
72
Table 4: Three examples from literature that show an inconsistent renaming of
identifiers in the pasted code fragment.
The Original Copied Code Fragment The Modified Pasted Code Fragment (Buggy)
1 File: linux-2.6.6/arch/sparc64/prom/memory.c (lines 92-99)
for(iter=0; iter<num_regs; iter++){
prom_phys_total[iter].start_adr =
prom_reg_memlist[iter].phys_addr;
prom_phys_total[iter].num_bytes =
prom_reg_memlist[iter].reg_size;
prom_phys_total[iter].theres_more =
&prom_phys_total[iter+1];
}
File: linux-2.6.6/arch/sparc64/prom/memory.c (lines 111-118)
for(iter=0; iter<num_regs; iter++){
prom_prom_taken[iter].start_adr =
prom_reg_memlist[iter].phys_addr;
prom_prom_taken[iter].num_bytes =
prom_reg_memlist[iter].reg_size;
prom_prom_taken[iter].theres_more =
&prom_phys_total[iter+1]; //error
}
2 File: bc-1.06/bc/storage.c (lines 118-150)
void
more_variables ()
{
int indx;
int old_count;
bc_var **old_var;
char **old_names;
/* Save the old values. */
old_count = v_count;
old_var = variables;
old_names = v_names;
/* Increment by a fixed amount and allocat...
v_count += STORE_INCR;
variables = (bc_var **) bc_malloc (v_count...
v_names = (char **) bc_malloc (v_count*siz...
/* Copy the old variables. */
for (indx = 3; indx < old_count; indx++)
variables[indx] = old_var[indx];
/* Initialize the new elements. */
for (; indx < v_count; indx++)
variables[indx] = NULL;
...
}
File: bc-1.06/bc/storage.c (lines 152-185)
void
more_arrays ()
{
int indx;
int old_count;
bc_var_array **old_ary;
char **old_names;
/* Save the old values. */
old_count = a_count;
old_ary = arrays;
old_names = a_names;
/* Increment by a fixed amount and allocat...
a_count += STORE_INCR;
arrays = (bc_var_array **) bc_malloc (a_co...
a_names = (char **) bc_malloc (a_count*siz...
/* Copy the old arrays. */
for (indx = 1; indx < old_count; indx++)
arrays[indx] = old_ary[indx];
/* Initialize the new elements. */
for (; indx < v_count; indx++) //error
arrays[indx] = NULL;
...
}
3 File: gcc-4.0.1/gcc/fortran/dependency.c (lines 414-415)
if (l_stride != NULL)
mpz_cdiv_q (X1, X1, l_stride->value.integer);
File: gcc-4.0.1/gcc/fortran/dependency.c (lines 422-423)
if (l_stride != NULL)
mpz_cdiv_q (X2, X2, r_stride->val... //error
73
Chapter 3
Methodology After implementing the CReN and LexId parts of CnP and improving CnP’s clone
visualization feature according to the clone property definitions and stages of clone
lifecycle support described in the previous chapter, a user study was conducted to test a
series of hypotheses related to these features.
3.1. User Study on CnP’s Visualization, CReN, and LexId The user study was designed to test three aspects of the CnP Eclipse plug-in that
involve user interaction – CnP’s clone visualization, CReN’s consistent identifier
renaming, and LexId’s consistent substring renaming – with eight programming tasks
across three task categories, shown in Table 5.
Table 5: High-level description of the tasks in the user study.
Test everything. Hold on to that which is good.
- 1 Thessalonians 5:21
74
3.1.1. User Study Hypotheses Based on inferences about the tool’s expected behavior, a set of hypotheses were
developed for each software feature to test with human subjects in a controlled setting.
CnP Clone Visualization Hypotheses:
1. CnP’s clone visualization makes it faster for programmers to find software bugs
in copied-and-pasted code than debugging manually or with other tools, when the
cloning information is not fresh in their memories.
2. CnP’s clone visualization makes it faster and less error-prone for programmers to
make modifications to copied-and-pasted code than modifying without
visualization, when the cloning information is not fresh in their memories.
CReN Identifier Renaming Hypotheses:
3. Using CReN to rename identifier instances consistently in copied-and-pasted code
is quicker than performing the same task manually or with other tools.
4. CReN prevents such inconsistent renaming errors that can happen otherwise.
LexId Substring Renaming Hypotheses:
5. Using LexId to rename substring instances consistently in copied-and-pasted code
is quicker than performing the same task manually or with other tools.
6. LexId prevents such inconsistent renaming errors that can happen otherwise.
Eagleson’s law: Any code of your own that you
haven’t looked at for six or more months might as
well have been written by someone else.
75
In order to validate these claims, a set of tasks were created in three main areas of
programming – debugging (for CnP’s clone visualization), modification (also for CnP’s
clone visualization), and renaming (for CReN and LexId). This resulted in eight tasks
total, two for each task category or tool feature (Table 5).
3.1.2. Subject Characteristics After Clarkson University Institutional Review Board (IRB) approval, a
recruitment email (in Appendix A) was sent to all Clarkson University Math & Computer
Science and Electrical & Computer Engineering undergraduate and graduate students via
their department mailing lists. It was required that the participants in the user study were
able to read and write simple Java programs and that they have had experience with
Swing (graphical user interface programming). Familiarity with integrated development
environments (IDEs), especially Eclipse, was preferred, but not necessary.
Fourteen subjects had participated in the user study, eight who were
undergraduate students and six who were graduate students. All subjects were male.
Based on the answers given on a user experience questionnaire (in Appendix C), seven
subjects had a long-time knowledge of both Java and Swing, three subjects had a long-
time knowledge of Java but only recent knowledge of Swing, and the remaining four
subjects reported having relatively recent knowledge of both Java and Swing (having
learned them for the first time within the past two years).
Seven subjects said that they have written at least 10,000 lines of Java code since
they have known the language, while the other seven gave a lower estimate when
describing their Java programming experience. Some subjects mentioned that they have
76
worked on medium to large-sized software projects, but this was not quantifiable. Ten
subjects out of the fourteen said that the software that they have written in Java were for
courses only, while the other four subjects had worked on software projects in industry.
As a result, subjects who had worked on projects outside of the classroom considered
themselves to be “very experienced”, while those who were relatively new to Java and
Swing or had only used them for courses claimed to be knowledgeable about Java and
Swing, but admitted not being regular users of them.
3.1.3. Study Procedure Each subject was invited, one at a time, into a user study laboratory for a session
that lasted between one and two hours. After signing an informed consent form (in
Appendix B), background information was presented to the subject about what code
clones are and problems that can result from copy-and-paste programming. The subjects
were then given a short introduction to the software tool’s three features, explaining how
they address the mentioned copy-paste-modify issues, and they were told about the other
tools that they could use during the experiment when they were doing a task without CnP.
Following this part of the introduction material, the subjects were shown the
software that they would be working with during the user study, which had been set up
within the Eclipse IDE (Ganymede version 3.4.2 build M20090211-1700) on Windows
XP. The publicly available Paint program from Carnegie Mellon University [16] was
adapted for use in the user study, which is shown with its identifier names labeled in
Figure 24. The Paint program was a good choice for the user study, since GUI software
often contains many common code fragments that are intentionally copied and pasted and
not abstracted away into procedures [210]. The program was run to show them what it
77
looks like graphically, and then the major parts of the source code were shown to them so
that they would have some basic familiarity with it before starting the programming
tasks. The subjects were told that all tasks in the user study would be performed in one
file (PaintWindow.java, which is 264 lines of code).
The presentation was then concluded by reminding the subjects that they will
complete eight programming tasks (four for clone visualization, two for CReN, and two
for LexId), followed by a user experience questionnaire (in Appendix C). They were
asked to talk aloud while completing each task, so that their thoughts and actions could
be better understood for analysis, and they had to announce when they were finished with
the task. Each task had a time limit that the subject was told about in advance (Table 5).
As an incentive to work as efficiently as possible, everyone was told that extra
compensation would be provided to the four subjects who completed the eight tasks with
the best accuracy and speed. Finally, the subjects were told that their session was being
recorded with TechSmith’s Morae software (version 2.0.1), which captured the activity
on the computer screen and recorded the user with video and audio.
78
Figure 24: The CMU Paint program used in the user study with widgets annotated by corresponding instance variables.
79
3.1.4. Task Descriptions Before the subject was asked to start a task, the current task’s description was
read to him, which explained the problem that he was to solve with a screenshot of what
the desired solution should look like. The subject was able to keep this instruction sheet
and a sheet that contained the identifier names used in the program to look at as a
reference. Subjects were also allowed to use an online Swing tutorial that had been
opened in the browser and to ask any clarification questions before beginning. The
specific Paint program that was different for each task was run for the subjects to see the
task’s problem visually. They were told whether the current task involved CnP, and if
not, they were reminded of some other tools that they could use.
Each pair of tasks (Tasks 1 & 2, Tasks 3 & 4, Tasks 5 & 6, and Tasks 7 & 8) was
designed to be very similar in terms of difficulty and effort, yet different enough that
there would not be a learning effect. The subjects completed each of the eight tasks once,
alternating task completion with and without the CnP software tool. Of the fourteen
subjects who participated in the user study, seven completed a certain task with the tool,
while the other seven participants completed that same task without the tool present.
(Note: The subject sample size was fourteen, not seven. Similar tasks were paired
together rather than the subjects being paired together for comparison to avoid
introducing this variable into the study. The same subject’s performance on one of the
tasks in a pair with the tool present was compared with his own performance on the other
task in the pair without the tool available).
80
3.1.4.1. Debugging and Modifying within a Clone The first four tasks were debugging and modification tasks, which had a time
limit of ten minutes each. The debugging process involved finding an existing bug in the
source code and fixing it. The modification process included adding a new feature to a
working version of the software, for example, inserting new code for titled borders or for
labels’ text colors. Modification required more than just renaming identifier names.
Debugging Tasks The Task 1 and Task 2 pair were debugging tasks to test Hypothesis 1 (listed in
Section 3.1.1). The subjects were told that the bug for each task was in cloned (copied
and pasted) code. Odd numbered subjects were asked to perform Task 1 with CnP
support and Task 2 without it, vice versa for even numbered subjects.
For the subjects who had CnP support for a task, the clone groups were already
highlighted with different colors in the PaintWindow.java file. It was explained to them
that those colored code fragments were copied and pasted before. The following five
clone groups were specified in PaintWindow (Figure 24):
• Group 1. The red (r) slider/panel, the green (g) slider/panel, the blue (b)
slider/panel, and the thickness (t) slider/panel.
• Group 2. The color panel and the thickness panel.
• Group 3. The tool panel and the clear/undo panel.
If debugging is the process of removing bugs,
then programming must be the process of putting
them in. - Edsger Dijkstra
81
• Group 4. The UI constraints for each panel: the tool panel, the color panel, the
thickness panel, and the clear/undo panel.
• Group 5. The declaration of the thickness change listener and the declaration of
the color change listener.
Subjects who did not have CnP’s clone visualization for a task were given the
option to use the clone detection tool CCFinderX (version 10.2.7.1). CCFinderX was the
clone detection tool of choice because its algorithm is token-based, which is known to
have high recall. CCFinderX also has a graphical interface, is available for free
download, and is easy to install and use. It was pre-installed on the computer for the user
study, the subjects were shown how to use it, and they were provided with written
instructions that they could refer back to.
The problem that the subjects were given in Task 1 was that “moving the blue
slider does not change the pixel color”. The program was run for them to show them that
this was the case, and they were shown that the other color sliders (red and green sliders)
however did work correctly. The subjects were then told to find the bug in the source
code and fix it so that the blue slider changes the color correctly.
The bug was that there was an instance of the identifier rSlider that appeared in
the blue slider/panel clone that was supposed to be bSlider (on line 120, shown in Figure
25). The blue slider clone could have likely been copied and pasted from the existing red
slider clone and then modified. Besides this bug in the program’s functionality, there
were no syntax errors reported by the compiler, since the slider and panel variables were
all defined at the class level as instance variables. This kind of error can happen in
practice and it would take extra time to search the code, fix it, and then test.
82
Figure 25: Task 1 – rSlider should be bSlider (on line 120).
Task 2’s problem stated that “moving the thickness slider does not change the
pixel thickness”. The subjects were shown this behavior by running the program. They
were shown that the color sliders all work in this task and they were told to find the bug
in the code and fix it so that the thickness slider changes the pixel thickness correctly
when moved.
The correct solution to Task 2 was to change the identifier instance of
colorChangeListener in the thickness slider clone to thicknessChangeListener (on line
142, shown in Figure 26). Eclipse showed warning icons that said that “the field
PaintWindow.thicknessChangeListener is never read locally” at the listener’s declaration,
but it was not obvious that any subject had noticed this warning, since they instinctively
went straight to the section of the file that contains the slider clones. The bug in this task
could have happened when the tSlider clone was created by copying and pasting one of
the color slider clones. In this scenario, the programmer may have focused on renaming
the identifiers to tSlider and tPanel, overlooking the single change of renaming
colorChangeListener to thicknessChangeListener.
83
Figure 26: Task 2 – colorChangeListener should be thicknessChangeListener (on
line 142). Modification Tasks The next pair of tasks (Task 3 and Task 4) that the subjects were asked to
complete were modification tasks, which were created to test Hypothesis 2 (in the list in
Section 3.1.1). For these tasks, the subjects were given a bug-free Paint program, where
they were asked to add a specific feature to it. The subjects were told that the
modifications would be made to cloned (copied and pasted) code. Odd numbered subjects
completed Task 3 with CnP clone visualization present and Task 4 without. Even
numbered subjects had CnP for Task 4 and not Task 3. Similar to the debugging tasks,
the CnP clone visualization was set up in advance and the subjects were allowed to use
CCFinderX (but were not required to use it) for the task that they did not have CnP
present.
For Task 3, the subjects were asked to add a titled border with the label “Pixel
Color” to the color panel (colorPanel) and to also add a titled border with the label “Pixel
Thickness” to the thickness panel (thicknessPanel), shown graphically in Figure 27. All
subjects were given written hints about how to create a titled border and how to set a
border to a panel, in case they were unfamiliar with the TitledBorder package.
84
Figure 27: Titled borders are shown around the color panel and the thickness panel.
85
Though there are many ways it can be written, Figure 28 gives one solution to this
task. The boxed lines (lines 130 and 153) are the modifications made (added lines).
Figure 28: Task 3 – add a titled border to colorPanel and to thicknessPanel.
That is, to add a titled border to the xxxxPanel (where xxxx is either color or thickness)
the code would be:
xxxxPanel.setBorder(new TitledBorder(“Pixel Xxxx”));
86
Task 4 asked the subjects to add color to the label of each color slider such that
the color of the red slider’s label’s text would be red, the color of the green slider’s
label’s text would be green, and the color of the blue slider’s label’s text would be blue,
shown graphically in Figure 29. All subjects were given written hints about how to create
the colors red, green, and blue and how to set the foreground color of a label, in case they
were unfamiliar with the exact syntax.
Figure 29: The labels of the red, green, and blue sliders are shown colored.
Though there are many ways it can be written, Figure 30 gives one solution to this
task. The boxed lines (lines 95-97, 108-110, and 121-123) are the modifications made
(added and changed lines).
87
Figure 30: Task 4 – add color to the label of each color slider: red, green, and blue.
That is, to add a color to the x label (where x is either r, g, or b) the code would be:
If red, (r,g,b) = (255,0,0) If green, (r,g,b) = (0,255,0) If blue, (r,g,b) = (0,0,255)
Replace: xPanel.add(new JLabel(“Xxxx”));
With: JLabel xLabel = new JLabel(“Xxxx”); xLabel.setForeground(new Color(r,g,b)); xPanel.add(xLabel);
88
3.1.4.2. Renaming within a Clone The last four tasks were renaming tasks, which had a time limit of five minutes
each. The renaming process involved changing the name of multiple instances of
identifiers (or parts of identifiers) within copied and pasted code.
Renaming Tasks (with CReN) Next, the subjects were given the first pair of renaming tasks (Task 5 and Task 6),
which were to test Hypotheses 3 and 4 (in Section 3.1.1 of this dissertation) that relate to
the CReN feature. For these tasks, the subjects were given a version of the Paint program
that had all of the correct source code except for requiring renaming within one or more
pasted code fragments (which were copied and pasted beforehand). The subjects were
told that they would need to make modifications (renaming) to copied and pasted code.
CReN was used by odd numbered subjects in Task 5 and by even numbered subjects in
Task 6. For the tasks that subjects could not use CReN (Task 6 for odd subjects, Task 5
for even subjects), they were allowed to use any built-in Eclipse tools, such as Rename
Refactoring or Find & Replace, or they could always do the renaming manually.
The subject was interrupted during the renaming tasks to increase the odds of
inconsistent renaming. A regular kind of interruption that someone might experience at
work was simulated. For example, the programmers were asked to do some simple
... programming requires more concentration
than other activities. It’s the reason programmers
get upset about ‘quick interruptions’ - such
interruptions are tantamount to asking a juggler
to keep three balls in the air and hold your
groceries at the same time.
- Steve McConnell, Code Complete
89
paperwork like to sign another informed consent form that they could keep for their
records or to fill out a form with their personal information so that they could receive
payment for participation in the user study. The interruption time was not included as part
of the time to complete the task, since this was not actual time spent working.
In Task 5, the subjects were told that the Paint program needs a thickness panel
that contains the thickness slider and its panel, which is similar to the color panel that
contains the red, green, and blue sliders and their panels. In this scenario, the color panel
(colorPanel) has already been completed and the thickness panel (thicknessPanel) needs
to be finished. As a last step to complete the thickness panel, the color panel was copied
and pasted. For this task, the subjects had to rename all five instances of colorPanel to
thicknessPanel and the one instance of rPanel to tPanel within the pasted code fragment
only as shown in Figure 31 being done with CReN.
Figure 31: Task 5 – rename colorPanel to thicknessPanel.
Similarly, Task 6 explained that the Paint program had everything working,
including the tool panel (toolPanel), but just needed the clear/undo panel
(clearUndoPanel) to be completed (which is similar to the existing source code of the tool
panel). The specific task required the subjects to rename all instances of toolPanel to
clearUndoPanel, pencilButton to clearButton, and eraserButton to undoButton in the
90
pasted code. Figure 32 shows how CReN would rename all six instances of toolPanel in
this clone to clearUndoPanel when the programmer edits one of the instances.
Figure 32: Task 6 – rename toolPanel to clearUndoPanel. Renaming Tasks (with LexId) The last pair of renaming tasks that the subjects completed (Task 7 and Task 8)
tested Hypotheses 5 and 6 (Section 3.1.1), which are related to LexId. Like in the CReN
tasks, the subjects received the Paint program with the copy and pastes already made in it
and they were just required to perform the actual renaming of the identifiers or substrings
only. Odd numbered subjects performed Task 7 with LexId (Task 8 without it) and even
numbered subjects performed Task 8 with LexId (Task 7 without it). Subjects were given
the same options for renaming when without LexId support (Eclipse renaming tools or
manual edits) and the subjects were also interrupted during renaming.
Task 7 involved the scenario where the Paint program had all source code
including the code for the red slider (rSlider) and its panel (rPanel), but still needed the
green and blue sliders (gSlider and bSlider) and their panels (gPanel and bPanel). As the
final step in programming, the code for the red slider and its panel was copied and pasted
twice and the labels and comments modified in advance (so that the subjects only had to
focus on the actual renaming of identifiers and not other details). The subjects were told
to rename rPanel to gPanel and rSlider to gSlider in the green slider clone (shown in
91
Figure 33), and rPanel to bPanel and rSlider to bSlider in the blue slider clone. LexId
would rename these twenty identifier substrings very quickly with one edit to each clone
(the r to g in the green slider clone, and the r to b in the blue slider clone). This task is
particularly good for LexId rather than CReN as CReN would need to do two edits for
each clone, once for each of the different slider and panel identifiers.
Figure 33: Task 7 (Part 1) – rename rPanel to gPanel and rSlider to gSlider in the
green slider clone. For the last task of the user study (Task 8), the Paint program contained all of the
color sliders’ source code, but needed the thickness slider and its panel to be finished.
The source code of the blue slider and its panel were copied and pasted to use as a base
for the thickness slider clone. The subject was then asked to rename bPanel to tPanel and
bSlider to tSlider in the pasted clone only. Figure 34 shows the ten common substring
instances being renamed together with LexId. All labels, number values, comments, and
the listener were updated in advance so that the subject did not have to worry about those
kinds of changes, but only had to do the identifier substring renaming.
92
Figure 34: Task 8 – rename bPanel to tPanel and bSlider to tSlider in the thickness
slider clone.
93
Chapter 4
Results From the completed CnP user study, data was gathered for the amount of time
that it took for the fourteen subjects to complete each of the eight tasks (Section 4.1 –
Time per Task), which had been recorded with the Morae software. Besides speed,
whether the subjects had a correct solution was also looked at (Section 4.2 – Solution
Correctness). And, when reviewing the video recordings, some interesting differences in
how each subject solved the tasks were noticed (Section 4.3 – Method of Completion),
which helped in learning about common programming practices and how they may have
contributed to subjects’ mistakes. The insignificant results that were found for CnP clone
visualization (Tasks 1-4), in terms of task completion speed (Section 4.1), motivated a
more careful examination of how subjects performed the debugging and modification
tasks (Sections 4.2 and 4.3).
4.1. Time per Task In order to test the hypotheses (listed in Section 3.1.1), the amount of time it took
for subjects to complete the tasks needed to be recorded, specifically comparing the
experiment group (with the tool) versus the control group (without the tool), shown in
Table 6. Each individual subject’s time data was paired together over two tasks (in a
single task category). For example, Subject 1’s data was paired together – Subject 1
would have one task with CnP and the other without CnP – for the debugging tasks
(Tasks 1 & 2), and so on. As mentioned earlier, each pair of tasks within a task category
was made similar to each other such that this would also not be a variable in the
Correctness is clearly the prime quality. If a
system does not do what it is supposed to do, then
everything else about it matters little.
- Bertrand Meyer
94
experiment. Pairing a subject’s “with CnP” data with the same subject’s “without CnP”
data eliminates the variable between different people. This leaves the presence or non-
presence of the software tool as the only variable that should be observed in the
experiment.
Table 6 shows the average (mean) time in minutes that subjects took to complete
the pair of tasks, with or without CnP, CReN, or LexId. This is a “modified mean”, since
it does not contain the pairs of outliers or erroneous data pairs in its calculation, as these
subjects’ specific data pairs were eliminated from all analysis. An outlier was determined
to be any significantly different time compared to the rest of the subjects’ completion
times for the same task, as determined by a boxplot in MiniTab (Student Release 12).
Erroneous data was determined when there was a tool error (from a bug) or a user error
(from a misunderstanding of the task or the tool, or an observed lack of seriousness while
completing a task).
Table 6: The time (in minutes) to complete each pair of tasks.
95
In order to statistically analyze the time data, a Chi-Square Goodness of Fit Test
was first performed in MiniTab (version 15) to determine the nature of the data (whether
the data is normally distributed). This statistical test required the data to be in frequencies
[53], so the data was divided as data points less than the mean and data points greater
than or equal to the mean (shown in two columns in Table 6). With such small sample
sizes, it was unable to be concluded for sure that the data fit a normal distribution.
According to [254], a user study with a design of one factor (in this case, each
task category), two treatments (with and without the tool), and a paired comparison can
be analyzed with the Wilcoxon non-parametric test. The Wilcoxon Signed Rank Sum
Test was performed as the statistical method of choice and since the number of
observations/pairs was large enough, a normal approximation was also used, displayed in
Table 7. The Wilcoxon test focuses on the median difference, with the null hypothesis
(H0) stating that the median difference is equal to zero and the alternative hypothesis
(Ha) stating that the median difference is less than zero. In other words, for the
alternative hypothesis, it looked at whether subjects “with CnP” take a significantly lesser
amount of time to complete the tasks than “without CnP”. The results showed that in all
of the renaming tasks (both CReN and LexId), subjects completed the tasks quicker (in
less time) with the tool than without the tool (p value of 0.0017 for CReN tasks and p
value of 0.0139 for LexId tasks). That same conclusion could not be made for the
debugging and modification tasks (p value of 0.3632 and p value of 0.4801).
96
Table 7: Statistical hypothesis testing on the paired time data.
4.2. Solution Correctness Determining whether subjects had correct solutions to the tasks was done because
this is part of the hypotheses (listed in Section 3.1.1). The purpose was to test whether the
tool (CnP, CReN, and LexId) helps prevent inconsistencies or programming bugs/errors
that can happen without the tool present.
In Table 8, the number of subjects who had no errors at the time the program was
run and when they announced that they were finished with the task is shown. For
example, for Task 1, it shows that twelve subjects (out of the twelve who ran the
program) had no errors when they ran the program for the first time, that no subjects ran
the program more than once for Task 1, and two people did not run the program at all
(but both had a correct solution when finished). Subjects who did not run the program
may have been very confident in their solution’s correctness.
97
Table 8: Correct states when running the program or when finished.
There were cases where the programmers made copy-paste-modify errors that
they caught before they were finished and other times when they did not notice the
mistakes at all. All subjects who finished the debugging tasks (Tasks 1 & 2) had correctly
spotted the existing bugs in the source code. The first modification task (Task 3) had the
most people without correct solutions. Four subjects with CnP visualization support and
one subject without CnP support made the same mistake of adding the titled border
around tPanel instead of thicknessPanel (thicknessPanel contains the single panel tPanel,
with the source code and CnP clone visualization shown in Figure 28). For Task 4, many
subjects ran the program multiple times to check the correctness of partial solutions.
Over all of the twenty-eight renaming cases (Table 8, Tasks 5-8), there were
fourteen incorrect states of the program (either when running the program or when
finished). Thirteen of these were due to renaming mistakes made by subjects when CReN
and LexId were not used (sometimes because the interruption would make the subjects
forget what they were working on). These errors could have been prevented with CReN
or LexId. The other incorrect state was due to an engineering oversight. Specifically,
98
CReN undid a group of consistently renamed identifiers one-by-one after a subject did a
series of “undo” operations in Eclipse during Task 6. In the future, CReN can be
modified to undo all previous renaming together as a single transaction rather than one at
a time. Since CReN is still a proof-of-concept, this oversight is acceptable.
4.3. Method of Completion Although it was not hypothesized about how the subjects would complete each
task, interesting differences were found between subjects’ methods of completion that
may have had some effect on their time per task or solution correctness.
For the debugging (Tasks 1 & 2) and modification (Tasks 3 & 4) tasks, it was
found that the subjects used different methods to locate the source code region that the
task applied to and then different ways to inspect the area for bugs or inconsistencies.
Table 9 shows that most subjects manually scanned the source code file, scrolling to find
their region of interest (eleven or twelve out of fourteen, shown in column 2), but a few
(two or three, shown in column 1) used the Find part of the Find & Replace tool or the
Search References tool in Eclipse to jump to a specific location in the file. For the
debugging cases, the subjects often went directly to the areas of the source code that had
to do with either the change listeners’ declaration or its use, indicating that these subjects
probably had prior experience with such code. In the first task, half of the people
compared the broken blue slider code with the red and green slider clones that they knew
worked. For the first modification task (Task 3), five subjects who had an incorrect final
solution all appeared to “inspect the problem region only” as they did not notice that
colorPanel and thicknessPanel were clones (not colorPanel and tPanel).
99
Table 9: Number of subjects who used each location and inspection method for
debugging and modification tasks. For the renaming tasks (Tasks 5-8), it was interesting to see which methods the
subjects used when not using CReN or LexId for renaming. Table 10 shows that subjects
used Find & Replace, copying and pasting, or manual typing when not using the tool.
Some subjects used the default settings of Find & Replace, which finds the first
occurrence, renames that current instance, then finds the next occurrence, renames it, and
repeats. Subjects who may have wanted to be more efficient chose to configure Find &
Replace to only rename within selected lines (the code fragment of interest). Some
subjects copied and pasted substrings, but the more efficient programmers often chose to
copy and paste whole identifier names (even when a substring was the only change to
make) because it is quicker to select whole identifiers in Eclipse rather than to select part
of an identifier with the mouse. Sometimes subjects started using one method, but
switched to another during the same task (shown in the table with .5 increments). One
subject even tried using Rename Refactoring (not shown in the table) to do the renaming
for Task 5, but he had to choose another method, since refactoring does not work well for
this kind of task (it renamed the declaration and instances within the copied code
100
fragment, too). Still some programmers always chose to just type really quickly
(manually) rather than copy and paste or use any renaming tool.
Table 10: Number of times each renaming method was used for renaming tasks.
101
Chapter 5
Discussion Based on the outcomes of the completed CnP user study experiment, some areas
where the experiment design could be refined and where the tool design could be
improved were noticed. In particular, factors that may have masked the effect of clone
visualization during debugging and modification tasks were analyzed.
5.1. Confounding Factors for Clone Visualization Based on the current data from the user study, the original hypotheses about
CnP’s clone visualization feature was unable to be validated, but the usefulness of clone
visualization was observed. For example, if subjects had made use of cloning information
in Task 3, they would have produced correct solutions (modifying the thicknessPanel
clone rather than the tPanel clone). Even when clone visualization was provided for the
debugging and modification tasks, it was not forced on the user, so the programmer may
not have used it (unlike CReN and LexId, which had to be used when present). Although
it was difficult to tell for sure whether the subjects actually used the visualization, it was
found that some subjects seemed to compare clones (Table 9) and that cloning
information can help avoid incorrect solutions when used.
Subjects’ experience levels can have an effect on their debugging strategies. For
the debugging tasks, some programmers appeared to use a typographic debugging
To err is human. - Anonymous
Good programmers know what to write.
Great ones know what to rewrite (and reuse).
- Anonymous
102
strategy (where they examined code line by line), while others used a more symptom-
driven approach (where they went directly to the problem region) [56, 183, 201, 202,
248]. The subject’s choice of debugging method seemed to have an effect on whether he
compared clones. If the subject knew the code pattern due to prior experience, he would
know which code belongs to which behavior and he would quickly go to its location in
the file without using the clone visualization. Visualization would be more helpful to
programmers without as much prior knowledge.
It was also observed that the subjects who were more familiar with the Eclipse
IDE itself had used it to their advantage to gain efficiency (using Eclipse’s occurrence
marking and code completion features). And, sometimes subjects did not choose to run
the program if they were more confident in their solution. While this would save time, it
could leave an undetected error, as it did for some subjects.
According to the questionnaire (in Appendix C), no subjects normally use clone
detection tools to debug, and the one subject that used CCFinderX during the user study
reached the time limit for the debugging task. The subjects said that the clone detection
tool had too steep of a learning curve to use, and that they normally debug Java source
code by mentally tracing through it, using print statements, or using the built-in Eclipse
debugger if necessary. The one subject who used the Eclipse debugger during the user
study also took a much longer time to finish the task than others.
103
5.2. Threats to Validity As said in Section 3.1.2, all subjects were required to have some familiarity with
Java and Swing in order to participate in the study. However, it was clear that some
subjects who performed very well may have had more experience or prior knowledge
than others. This substantial distribution of programming skills is a universal research
issue not unique to this research. While the study had both graduate and senior
undergraduate students as subjects, four of them had non-trivial industrial experience.
Thus, while the sample of subjects may not represent the whole programmer population,
they should at least be a good representation of the entry-to-medium level subset.
Furthermore, some of the subjects with industrial experience also made mistakes when
performing tasks without the tool. Although the claims cannot be generalized to all
programmers, this study demonstrates the CnP tool’s usefulness in clone-related
maintenance tasks.
While the user study’s programming tasks may not represent all possible
scenarios of cloning, they were fairly close to real-world tasks in GUI software
development and maintenance. (Relatively simple debugging and modification tasks were
deliberately chosen in order to measure completion time). CnP’s features can be applied
throughout source code evolution, making it a long-term benefit.
5.3. Tool Design The feedback from the subjects about the tool features was overwhelmingly
positive. All subjects said that they would use the tool to help them prevent copy-and-
paste-related errors.
104
Their main suggestions for changes to the tool are:
1. To have the visualization optional (able to be turned off), while continuing to
track clones in the background, and
2. To know exactly what was renamed by CReN and LexId.
Colored bars were chosen for visualization because it is a simple way to highlight
a clone region in the files that the programmers are already working with. The current
visualization can be improved so that the many (sometimes overlapping) clones’ bars do
not start to clutter the source code files. One suggestion that a subject gave is to only
show the clones that are at the mouse cursor’s current position. Other users’ suggestions
to consider include the ability to disable the clone visualization until it is needed. Some
users also said that they prefer not to have so many different colors in their editor, while
one subject said that he could not easily differentiate between the colored bars, since he is
color blind.
It was evident from the user study that subjects often overestimated what was
renamed. At least two subjects did not know whether CReN renamed the identifier’s
declaration also (which was outside of the clone) like refactoring would. It was also
observed that when CReN automatically renamed a lot of instances of the currently
edited identifier, subjects sometimes prematurely thought that they were done renaming
everything, but there were still other identifiers left to be renamed. The tools could be
improved to display an optional pop-up window that tells exactly what was renamed.
105
Chapter 6
Conclusion A widely-known problem in software engineering involves the maintenance of
code clones (similar code fragments), which are often a result of copying and pasting.
Not only can the maintenance process itself be tedious, but inconsistent updates across
related clones can remain undetected, decreasing software quality. Consequently, many
people strongly dislike copy-and-paste and code clones, and seek to prevent the creation
of clones or at least aim to remove them from the source code as soon as possible by
replacing the multiple, related clones with a single abstraction. Recent studies, however,
have shown that clones may be beneficial and desirable, and rather than focusing on
clone detection and removal, there instead has been greater tool support for clone editing.
This dissertation contributed a suite of software tools, called CnP, that aid programmers
in their copy, paste, and modify coding activity throughout the clone lifecycle.
Specifically, CnP was the first known tool in the integrated development environment to
capture code clones proactively as they were made via copying and pasting, and it was
the first to support consistent modifications within a clone rather than only between
clones. CnP also utilizes the abstract syntax tree framework within the Eclipse IDE,
which provides more accurate clone tracking and clone comparison than other commonly
used methods of source code representation. CnP’s clone visualization, consistent
identifier renaming (CReN), and consistent substring renaming (LexId) features were
evaluated with a user study with analysis in terms of task completion time, solution
correctness, and method of completion.
The software isn't finished until the last user is
dead. - Anonymous
106
6.1. Research Contributions The main contributions of this research included:
• The copy-and-paste (CnP) tool
o Proactive tracking – CnP/CReN were the first known clone tracking
tools published (in 2007), which took a more proactive approach to
capturing clones upon creation (by detecting when a copy and paste occurs
and gathering the initial clone and identifier information at that time when
the clones are identical).
o Intra-clone editing – CReN was the only known tool to support editing
within a clone (all previous tools only supported between-clone editing).
Intra-clone editing is done when programmers copy, paste, and modify the
pasted code to fit the current task. The kind of modification that is made in
these cases is often identifier renaming, which is what CReN supports.
o AST-based – CnP makes use of the abstract syntax tree (AST)
representation of the source code, which is a better approach than the text-
based methods that cannot differentiate between source code and any other
text. CSeR is one of the few differencing tools to take advantage of ASTs.
• Dimensions of clone tracking tool development – When comparing CnP with
related clone tracking tools, a variety of clone properties were determined that
these kinds of tools must explicitly define. Listing the properties can be useful in
the creation of new tools or to help redefine a tool’s current property definitions.
107
• Definition of the clone lifecycle – The comparison of tools also led to a definition
of the clone lifecycle stages, including some areas where there is current tool
support and areas that need more support.
• Realization about clone visualization – After completing a user study on CnP,
CnP’s clone visualization was not found to provide statistically quicker and
correct solutions than without it. Observation and other analysis (in Section 5.1)
helped better determine whether and when a programmer may exploit clone
information. There is no other known similar analysis of the role of clone
information in maintenance tasks, and, thus the analysis in and of itself can be a
contribution. The analysis can be used in the design of future experiments.
6.2. Future Work As software tools get used by more people, there are always suggestions for new
features. As mentioned earlier, more features would need to be added to CnP before it
fully supports the entire clone lifecycle, including inter-clone editing (simultaneous edits
between clones) and clone refactoring support. CnP could also leverage version control
system information, which seems to be a “hot topic” in the field lately. A challenge is to
have all of the parts of CnP work together in the Eclipse editor at the same time (with the
ability to turn any feature on and off). There are also many ideas for additional
functionality in LexId, which generally is a tool for inferring lexical patterns in source
code. Furthermore, software demonstrations at conferences and the completion of a user
study have all provided a lot of feedback and recommendations for improvement. Though
CnP has already been improved upon (for example, its clone visualization), even the
current version of CnP is still just a prototype implementation with the concepts as the
108
true contribution. As the quote at the top of the chapter heading implies, software
maintenance is on-going and the software will most likely never be “complete” as the
users will always require additional fixes and updates to the current version.
6.2.1. Theory about Copy-and-Paste and Abstractions After doing this research, which primarily has focused on the modification or
editing stage of clones, other areas to examine include theories on copy-and-paste (the
creation of clones) and abstractions (the extinction of clones).
To find out more about copy-and-paste than what was already done [132], a
survey (questionnaire) can be done to get some preliminary insights into copy-and-paste
and abstractions: who, what, where, when, why, and how. For example, often
programmers just focus on getting a program to work rather than paying close attention to
the source code design and documentation. Instead of thinking about abstractions, the
programmer may copy and paste to get a quicker solution. Also, later requirements of the
software may not be known at the time it is first being written or they might change over
time, making the development of an abstraction difficult. Knowing which kind of
abstraction would be most useful for future reuse is important to avoid having to redo the
abstraction later. In addition, some people (especially novice programmers) may find
creating abstractions difficult in general [150]. A question to answer is: do people
(programmers) naturally think in terms of abstractions or rather in quick copy-and-
pastes? And after “temporary” copying and pasting, do programmers always refactor to
… the purpose of abstraction is not to be vague,
but to create a new semantic level in which one
can be absolutely precise. - Edsger Dijkstra
109
abstractions later on – why or why not? This study would serve as key background
information to support this research further.
Second, not only does more research need to be done on refactoring as an editing
process, but more research needs to be done on the abstractions themselves. Abstractions
can be very powerful [29, 222], since an abstraction solves sub-problems [175] and a
single abstraction does more than one computation. But, some abstractions can be
“fragile” and others can be too awkward. In this portion of the study, a question to answer
is: what are the properties of a nice abstraction? And if a nice abstraction cannot be made,
is a copied-and-pasted solution okay?
6.2.2. Other Applications of This Research The concepts of this research can be applied to other IDEs, programming editors,
and programming languages. They can even apply outside of the programming domain.
Not only do people copy and paste when writing software, but they often copy and paste
when writing any other document. An extension of this research in preventing copy-and-
paste-related errors in the IDE could be to prevent copy-and-paste-related errors in the
word processor, for example. If a paragraph that had a typo in it was copied and pasted
multiple times and the typo was found later on by a copy editor, wouldn’t it be difficult to
fix that typo in all spots (especially when this person was not the original author)? Also,
is better renaming support needed in the word processor? What about the use of
templates? Is there anything to be learned from looking at the evolution of word
documents across versions? All of these areas would be interesting to look into to see if
the process of copy, paste, and modify can be improved in every place where it is done.
110
References [1] E. Adar and M. Kim, “SoftGUESS: Visualization and Exploration of Code Clones
in Context”, ACM SIGSOFT-IEEE International Conference on Software
Engineering (ICSE), 2007. [2] R. Al-Ekram, C. Kapser, R. Holt, and M. Godfrey, “Cloning by Accident: An
Empirical Study of Source Code Cloning Across Software Systems”, ACM
SIGSOFT-IEEE International Symposium on Empirical Software Engineering
(ISESE), 2005. [3] F. Aliverti-Piuri, “Java Copier Frees You from Tedious Coding”, DevX
Jupitermedia Corporation, 2007. http://www.devx.com/Java/Article/17947/1954?pf=true
[4] N. Anquetil and T. Lethbridge, “Assessing the Relevance of Identifier Names in a
Legacy Software System”, IBM Conference of the Centre for Advanced Studies on
Collaborative Research (CASCON), 1998. [5] T. Apiwattanapong, A. Orso, and M.J. Harrold, “A Differencing Algorithm for
Object-Oriented Programs”, ACM SIGSOFT-SIGART-IEEE International
Conference on Automated Software Engineering (ASE), 2004. [6] D. Atkins, T. Ball, T. Graves, and A. Mockus, “Using Version Control Data to
Evaluate the Impact of Software Tools”, ACM SIGSOFT-IEEE International
Conference on Software Engineering (ICSE), 1999. [7] D.L. Atkins, “Version Sensitive Editing: Change History as a Programming Tool”,
ACM SIGPLAN-SIGSOFT European Conference on Object-Oriented
Programming (ECOOP), 1998. [8] L. Aversano, L. Cerulo, and M. Di Penta, “How Clones are Maintained: An
Empirical Study”, IEEE European Conference on Software Maintenance and
Reengineering (CSMR), 2007. [9] B.S. Baker, “A Theory of Parameterized Pattern Matching: Algorithms and
Applications”, ACM SIGACT Symposium on Theory of Computing (STOC), 1993. [10] B.S. Baker, “Finding Clones with Dup: Analysis of an Experiment”, IEEE
Transactions on Software Engineering (TSE), 2007. [11] B.S. Baker, “On Finding Duplication and Near-Duplication in Large Software
Systems”, IEEE Working Conference on Reverse Engineering (WCRE), 1995.
111
[12] B.S. Baker, “Parameterized Duplication in Strings: Algorithms and an Application to Software Maintenance”, SIAM Journal on Computing, 1997.
[13] T. Bakota, R. Ferenc, and T. Gyimothy, “Clone Smells in Software Evolution”,
IEEE International Conference on Software Maintenance (ICSM), 2007. [14] M. Balint, T. Girba, and R. Marinescu, “How Developers Copy”, IEEE
International Conference on Program Comprehension (ICPC), 2006. [15] T. Ball, S. Diehl, D. Notkin, and A. Zeller, “Multi-Version Program Analysis”,
Dagstuhl Seminar, 2005. [16] D.R. Barstow, “Overview of a Display-Oriented Editor for INTERLISP”,
International Joint Conference on Artificial Intelligence (IJCAI), 1981. [17] H.A. Basit and S. Jarzabek, “A Case for Structural Clones”, International
Workshop on Software Clones (IWSC), 2009. [18] H.A. Basit and S. Jarzabek, “Detecting Higher-level Similarity Patterns in
Programs”, European Software Engineering Conference (ESEC) and ACM
SIGSOFT International Symposium on the Foundations of Software Engineering
(FSE), 2005. [19] H.A. Basit, D.C. Rajapakse, and S. Jarzabek, “Beyond Templates: A Study of
Clones in the STL and Some General Implications”, ACM SIGSOFT-IEEE
International Conference on Software Engineering (ICSE), 2005. [20] S. Bates and S. Horwitz, “Incremental Program Testing Using Program
Dependence Graphs”, ACM SIGPLAN-SIGACT Symposium on Principles of
Programming Languages (POPL), 1993. [21] I.D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier, “Clone Detection
Using Abstract Syntax Trees”, IEEE International Conference on Software
Maintenance (ICSM), 1998. [22] S. Bellon, “Detection of Software Clones: Tool Comparison Experiment”,
University of Stuttgart, 2004-Present. http://www.bauhaus-stuttgart.de/clones/ [23] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo, “Comparison and
Evaluation of Clone Detection Tools”, IEEE Transactions on Software Engineering
(TSE), 2007. [24] D. Binkley, “Semantics Guided Regression Test Cost Reduction”, IEEE
Transactions on Software Engineering (TSE), 1997.
112
[25] D. Binkley, R. Capellini, L.R. Raszewski, and C. Smith, “An Implementation of and Experiment with Semantic Differencing”, IEEE International Conference on
Software Maintenance (ICSM), 2001. [26] D. Binkley, H. Feild, D. Lawrie, and M. Pighin, “Software Fault Prediction using
Language Processing”, IEEE Testing: Academic & Industrial Conference -
Practice And Research Techniques (TAIC PART), 2007. [27] M. Boshernitsan, “Harmonia: A Flexible Framework for Constructing Interactive
Language-Based Programming Tools”, University of California, Berkeley, Technical Report CSD-01-1149, 2001.
[28] J. Brandt, P.J. Guo, J. Lewenstein, S.R. Klemmer, and M. Dontcheva,
“Opportunistic Programming: Writing Code to Prototype, Ideate, and Discover”, IEEE Software, 2009.
[29] J. Brockman, “Intentional Programming: A Talk With Charles Simonyi”, Edge
Foundation, 2004. http://www.edge.org/digerati/simonyi/simonyi_p1.html [30] M. Broy, F. Deissenboeck, and M. Pizka, “Demystifying Maintainability”,
International Workshop on Software Quality (WoSQ), 2006. [31] G. Canfora, L. Cerulo, and M. Di Penta, “Identifying Changed Source Code Lines
from Version Repositories”, ACM SIGSOFT-IEEE International Workshop on
Mining Software Repositories (MSR), 2007. [32] G. Canfora, L. Cerulo, and M. Di Penta, “Ldiff: An Enhanced Line Differencing
Tool”, ACM SIGSOFT-IEEE International Conference on Software Engineering
(ICSE), 2009. [33] G. Canfora, L. Cerulo, and M. Di Penta, “Tracking Your Changes: A Language-
Independent Approach”, IEEE Software, 2009. [34] B. Caprile and P. Tonella, “Restructuring Program Identifier Names”, IEEE
International Conference on Software Maintenance (ICSM), 2000. [35] R. Caron, “Coding Techniques and Programming Practices”, Microsoft
Corporation, 2000. http://msdn2.microsoft.com/en-
us/library/aa260844%28VS.60%29.aspx [36] B. Carter, “On Choosing Identifiers”, ACM SIGPLAN Notices, 1982. [37] S.S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom, “Change
Detection in Hierarchically Structured Information”, ACM SIGMOD International
Conference on Management of Data (SIGMOD), 1996.
113
[38] A. Chiu and D. Hirtle, “Beyond Clone Detection”, University of Waterloo, Course Project, 2007.
[39] J.R. Cordy, “Comprehending Reality - Practical Barriers to Industrial Adoption of
Software Maintenance Automation”, IEEE International Workshop on Program
Comprehension (IWPC), 2003. [40] J.R. Cordy, T.R. Dean, and N. Synytskyy, “Practical Language-Independent
Detection of Near-Miss Clones”, IBM Conference of the Centre for Advanced
Studies on Collaborative Research (CASCON), 2004. [41] R. Cottrell, J.J.C. Chang, R.J. Walker, and J. Denzinger, “Determining Detailed
Structural Correspondence for Generalization Tasks”, European Software
Engineering Conference (ESEC) and ACM SIGSOFT International Symposium on
the Foundations of Software Engineering (FSE), 2007. [42] R. Cottrell, B. Goyette, R. Holmes, R.J. Walker, and J. Denzinger, “Compare and
Contrast: Visual Exploration of Source Code Examples”, IEEE International
Workshop on Visualizing Software for Understanding and Analysis (VISSOFT), 2009.
[43] R. Cottrell, R.J. Walker, and J. Denzinger, “Jigsaw: A Tool for the Small-Scale
Reuse of Source Code”, ACM SIGSOFT-IEEE International Conference on
Software Engineering (ICSE), 2008. [44] R. Cottrell, R.J. Walker, and J. Denzinger, “Semi-automating Small-Scale Source
Code Reuse via Structural Correspondence”, ACM SIGSOFT International
Symposium on the Foundations of Software Engineering (FSE), 2008. [45] F.J. Damerau, “A Technique for Computer Detection and Correction of Spelling
Errors”, Communications of the ACM, 1964. [46] M. de Wit, “Managing Clones Using Dynamic Change Tracking and Resolution,
Helping Developers to Cope with Changing Clone Fragments”, Delft University of
Technology, Master’s Thesis, 2008. [47] M. de Wit, A. Zaidman, and A. van Deursen, “Managing Code Clones Using
Dynamic Change Tracking and Resolution”, IEEE International Conference on
Software Maintenance (ICSM), 2009. [48] F. Deissenboeck and M. Pizka, “Concise and Consistent Naming”, IEEE
International Workshop on Program Comprehension (IWPC), 2005. [49] F. Deissenboeck and M. Pizka, “Concise and Consistent Naming”, Software
Quality Journal, Springer, 2006.
114
[50] F. Deissenboeck, M. Pizka, and T. Seifert, “Tool Support for Continuous Quality Assessment”, IEEE International Workshop on Software Technology and
Engineering Practice (STEP), 2005. [51] F. Detienne, “Reasoning from a Schema and from an Analog in Software Code
Reuse”, Workshop on Empirical Studies of Programmers (ESP), 1991. [52] F. Detienne, Software Design - Cognitive Aspects, Springer, 2002. [53] J. Devore and N. Farnum, Applied Statistics for Engineers and Scientists,
Brooks/Cole Publishing Company, 1999. [54] E. Duala-Ekoko and M.P. Robillard, “CloneTracker: Tool Support for Code Clone
Management”, ACM SIGSOFT-IEEE International Conference on Software
Engineering (ICSE), 2008. [55] E. Duala-Ekoko and M.P. Robillard, “Tracking Code Clones in Evolving
Software”, ACM SIGSOFT-IEEE International Conference on Software
Engineering (ICSE), 2007. [56] M. Ducasse and A. Emde, “A Review of Automated Debugging Systems:
Knowledge, Strategies and Techniques”, ACM SIGSOFT-IEEE International
Conference on Software Engineering (ICSE), 1988. [57] S. Ducasse, M. Rieger, and S. Demeyer, “A Language Independent Approach for
Detecting Duplicated Code”, IEEE International Conference on Software
Maintenance (ICSM), 1999. [58] D. Engler, D.Y. Chen, S. Hallem, A. Chou, and B. Chelf, “Bugs as Deviant
Behavior: A General Approach to Inferring Errors in Systems Code”, ACM
SIGOPS Symposium on Operating Systems Principles (SOSP), 2001. [59] L.W. Eriksen, “Code Reuse in Object Oriented Software Development”,
Norwegian University of Science and Technology, Course Project, 2004. [60] W.S. Evans, C.W. Fraser, and F. Ma, “Clone Detection via Structural Abstraction”,
IEEE Working Conference on Reverse Engineering (WCRE), 2007. [61] R. Falke, P. Frenzel, and R. Koschke, “Empirical Evaluation of Clone Detection
using Syntax Suffix Trees”, Empirical Software Engineering, Springer, 2008. [62] H. Feild, D. Binkley, and D. Lawrie, “An Empirical Comparison of Techniques for
Extracting Concept Abbreviations from Identifiers”, IASTED International
Conference on Software Engineering and Applications (SEA), 2006.
115
[63] J. Ferrante, K.J. Ottenstein, and J.D. Warren, “The Program Dependence Graph and Its Use in Optimization”, ACM Transactions on Programming Languages and
Systems (TOPLAS), 1987. [64] F. Fiorvanti, G. Migliarese, and P. Nesi, “Reengineering Analysis of Object-
Oriented Systems via Duplication Analysis”, ACM SIGSOFT-IEEE International
Conference on Software Engineering (ICSE), 2001. [65] N.V. Flor and E.L. Hutchins, “Analyzing Distributed Cognition in Software Teams:
A Case Study of Team Programming During Perfective Software Maintenance”, Workshop on Empirical Studies of Programmers (ESP), 1991.
[66] B. Fluri, M. Wuersch, M. Pinzger, and H.C. Gall, “Change Distilling: Tree
Differencing for Fine-Grained Source Code Change Extraction”, IEEE
Transactions on Software Engineering (TSE), 2007. [67] M. Fowler, K. Beck, J. Brant, W. Opdyke, and D. Roberts, Refactoring: Improving
the Design of Existing Code, Addison-Wesley, 1999. [68] P. Frenzel, R. Koschke, A.P.J. Breu, and K. Angstmann, “Extending the Reflection
Method for Consolidating Software Variants into Product Lines”, IEEE Working
Conference on Reverse Engineering (WCRE), 2007. [69] M. Gabel, L. Jiang, and Z. Su, “Scalable Detection of Semantic Clones”, ACM
SIGSOFT-IEEE International Conference on Software Engineering (ICSE), 2008. [70] H. Gall, M. Jazayeri, and J. Krajewski, “CVS Release History Data for Detecting
Logical Couplings”, ACM SIGSOFT-IEEE International Workshop on Principles
of Software Evolution (IWPSE), 2003. [71] K. Gallagher and L. Layman, “Are Decomposition Slices Clones?”, IEEE
International Conference on Program Comprehension (ICPC), 2003. [72] R. Geiger, B. Fluri, H.C. Gall, and M. Pinzger, “Relation of Code Clones and
Change Couplings”, European Joint Conferences on Theory and Practice of
Software (ETAPS) International Conference on Fundamental Approaches to
Software Engineering (FASE), 2006. [73] D. Gentner and A.B. Markman, “Structure Mapping in Analogy and Similarity”,
American Psychologist, 1997. [74] M.W. Godfrey, “All We Like Sheep: Cloning as an Engineering Tool”, Working
Session on Myths in Software Engineering at ICSM (MythSE), 2007. http://www.slideshare.net/migod/all-we-like-sheep-cloning-as-an-engineering-
tool
116
[75] M.W. Godfrey and L. Zou, “Using Origin Analysis to Detect Merging and Splitting of Source Code Entities”, IEEE Transactions on Software Engineering (TSE), 2005.
[76] T. Green and A. Blackwell, “Cognitive Dimensions of Information Artefacts: A
Tutorial”, British Computer Society Conference on Human-Computer Interaction
(BCS HCI), 1998. [77] W.G. Griswold, “Coping with Crosscutting Software Changes Using Information
Transparency”, ACM SIGSOFT International Conference on Metalevel
Architectures and Separation of Crosscutting Concerns (REFLECTION), 2001. [78] W.G. Griswold, “Program Restructuring as an Aid to Software Maintenance”,
University of Washington, PhD Dissertation, 1991. [79] W.G. Griswold and D. Notkin, “Computer-Aided vs. Manual Program
Restructuring”, ACM SIGSOFT Software Engineering Notes, 1992. [80] A.N. Habermann and D. Notkin, “Gandalf: Software Development Environments”,
IEEE Transactions on Software Engineering (TSE), 1986. [81] A.E. Hassan and T. Zimmermann, “Myth: Clones Are Evil.”, Working Session on
Myths in Software Engineering at ICSM (MythSE), 2007. http://mythse.wikispaces.com/Clones+are+evil.
[82] S. Hayashi, M. Saeki, and M. Kurihara, “Supporting Refactoring Activities Using
Histories of Program Modification”, The Institute of Electronics, Information and
Communication Engineers (IEICE), 2006. [83] Y. Higo, T. Kamiya, S. Kusumoto, and K. Inoue, “Method and Implementation for
Investigating Code Clones in a Software System”, Information and Software
Technology, Elsevier, 2006. [84] R. Hill and J. Rideout, “Automatic Method Completion”, ACM SIGSOFT-SIGART-
IEEE International Conference on Automated Software Engineering (ASE), 2004. [85] R. Holmes, “Pragmatic Software Reuse”, University of Calgary, PhD Dissertation,
2008. [86] R. Holmes, R. Cottrell, R.J. Walker, and J. Denzinger, “The End-to-End Use of
Source Code Examples: An Exploratory Study”, IEEE International Conference on
Software Maintenance (ICSM), 2009. R. Holmes, R. Cottrell, R.J. Walker, and J. Denzinger, “The End-to-End Use of
Source Code Examples: An Exploratory Study – Appendix”, University of Calgary, Technical Report 2009-934-13, 2009.
117
[87] R. Holmes and G.C. Murphy, “Using Structural Context to Recommend Source Code Examples”, ACM SIGSOFT-IEEE International Conference on Software
Engineering (ICSE), 2005. [88] R. Holmes and R.J. Walker, “Lightweight, Semi-Automated Enactment of
Pragmatic-Reuse Plans”, International Conference on Software Reuse (ICSR), 2008.
[89] R. Holmes and R.J. Walker, “Semi-Automating Pragmatic Reuse Tasks”, ACM
SIGSOFT-SIGART-IEEE International Conference on Automated Software
Engineering (ASE), 2008. [90] R. Holmes and R.J. Walker, “Supporting the Investigation and Planning of
Pragmatic Reuse Tasks”, ACM SIGSOFT-IEEE International Conference on
Software Engineering (ICSE), 2007. [91] R. Holmes and R.J. Walker, “Task-specific Source Code Dependency
Investigation”, IEEE International Workshop on Visualizing Software for
Understanding and Analysis (VISSOFT), 2007. [92] R. Holmes, R.J. Walker, and G.C. Murphy, “Strathcona Example Recommendation
Tool”, ACM SIGSOFT International Symposium on the Foundations of Software
Engineering (FSE), 2005. [93] S. Horwitz, “Identifying the Semantic and Textual Differences Between Two
Versions of a Program”, ACM SIGPLAN Conference on Programming Language
Design and Implementation (PLDI), 1990. [94] S. Horwitz and T. Reps, “The Use of Program Dependence Graphs in Software
Engineering”, ACM SIGSOFT-IEEE International Conference on Software
Engineering (ICSE), 1992. [95] D. Hou, P. Jablonski, and F. Jacob, “CnP: Towards an Environment for the
Proactive Management of Copy-and-Paste Programming”, IEEE International
Conference on Program Comprehension (ICPC), 2009. [96] D. Hou, F. Jacob, and P. Jablonski, “Exploring the Design Space of Proactive Tool
Support for Copy-and-Paste Programming”, IBM Conference of the Centre for
Advanced Studies on Collaborative Research (CASCON), 2009. [97] D. Hou, F. Jacob, and P. Jablonski, “Proactively Managing Copy-and-Paste
Induced Code Clones”, IEEE International Conference on Software Maintenance
(ICSM), 2009. [98] E. Hughes, “Checking Spelling in Source Code”, ACM SIGPLAN Notices, 2004.
118
[99] J.W. Hunt and M.D. McIlroy, “An Algorithm for Differential File Comparison”, Bell Laboratories, Bell Laboratories Computing Science Technical Report #41, 1976.
[100] P. Jablonski, “Clone-Aware Editing with CnP”, ACM SIGSOFT International
Symposium on the Foundations of Software Engineering (FSE), Student Research Forum, 2008.
[101] P. Jablonski, “Managing the Copy-and-Paste Programming Practice in Modern
IDEs”, ACM SIGPLAN Conference on Object-Oriented Programming, Systems,
Languages, and Applications (OOPSLA), 2007. [102] P. Jablonski and D. Hou, “Aiding Software Maintenance with Copy-and-Paste
Clone-Awareness”, IEEE International Conference on Program Comprehension
(ICPC), 2010. [103] P. Jablonski and D. Hou, “CReN: A Tool for Tracking Copy-and-Paste Code
Clones and Renaming Identifiers Consistently in the IDE”, Eclipse Technology
Exchange Workshop at OOPSLA (ETX), 2007. [104] P. Jablonski and D. Hou, “Renaming Parts of Identifiers Consistently within Code
Clones”, IEEE International Conference on Program Comprehension (ICPC), 2010.
[105] D. Jackson and D.A. Ladd, “Semantic Diff: A Tool for Summarizing the Effects of
Modifications”, IEEE International Conference on Software Maintenance (ICSM), 1994.
[106] F. Jacob, “CSeR - A Code Editor For Tracking & Visualizing Detailed Clone
Differences”, Clarkson University, Master’s Thesis, 2009. [107] F. Jacob, D. Hou, and P. Jablonski, “Actively Comparing Clones Inside The Code
Editor”, International Workshop on Software Clones (IWSC), 2010. [108] S. Jarzabek and S. Li, “Unifying Clones with a Generative Programming
Technique: A Case Study”, Journal of Software Maintenance and Evolution:
Research and Practice (JSME), John Wiley & Sons, 2006. [109] L. Jiang, “Scalable Detection of Similar Code: Techniques and Applications”,
University of California, Davis, PhD Dissertation, 2009. [110] L. Jiang, G. Misherghi, Z. Su, and S. Glondu, “DECKARD: Scalable and Accurate
Tree-based Detection of Code Clones”, ACM SIGSOFT-IEEE International
Conference on Software Engineering (ICSE), 2007.
119
[111] L. Jiang, Z. Su, and E. Chiu, “Context-Based Detection of Clone-Related Bugs”, European Software Engineering Conference (ESEC) and ACM SIGSOFT
International Symposium on the Foundations of Software Engineering (FSE), 2007. [112] J.H. Johnson, “Identifying Redundancy in Source Code using Fingerprints”, IBM
Conference of the Centre for Advanced Studies on Collaborative Research
(CASCON), 1993. [113] J.H. Johnson, “Substring Matching for Clone Detection and Change Tracking”,
IEEE International Conference on Software Maintenance (ICSM), 1994. [114] J.H. Johnson, “Visualizing Textual Redundancy in Legacy Source”, IBM
Conference of the Centre for Advanced Studies on Collaborative Research
(CASCON), 1994. [115] E. Juergens, F. Deissenboeck, and B. Hummel, “Clone Detection Beyond Copy &
Paste”, International Workshop on Software Clones (IWSC), 2009. [116] E. Juergens, F. Deissenboeck, and B. Hummel, “CloneDetective - A Workbench
for Clone Detection Research”, ACM SIGSOFT-IEEE International Conference on
Software Engineering (ICSE), 2009. [117] E. Juergens, F. Deissenboeck, B. Hummel, and S. Wagner, “Do Code Clones
Matter?”, ACM SIGSOFT-IEEE International Conference on Software Engineering
(ICSE), 2009. [118] E. Juergens, B. Hummel, F. Deissenboeck, and M. Feilkas, “Static Bug Detection
Through Analysis of Inconsistent Clones”, Testmethoden fur Software (TESO), 2008.
[119] T. Kamiya, “Variation Analysis of Context-Sharing Identifiers with Code Clones”,
IEEE International Conference on Software Maintenance (ICSM), 2008. [120] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A Multilinguistic Token-based
Code Clone Detection System for Large Scale Source Code”, IEEE Transactions
on Software Engineering (TSE), 2002. [121] C. Kapser and M.W. Godfrey, “Aiding Comprehension of Cloning Through
Categorization”, ACM SIGSOFT-IEEE International Workshop on Principles of
Software Evolution (IWPSE), 2004. [122] C. Kapser and M.W. Godfrey, “‘Cloning Considered Harmful’ Considered
Harmful”, IEEE Working Conference on Reverse Engineering (WCRE), 2006.
120
[123] C. Kapser and M.W. Godfrey, “Improved Tool Support for the Investigation of Duplication in Software”, IEEE International Conference on Software
Maintenance (ICSM), 2005. [124] C. Kapser and M.W. Godfrey, “Toward a Taxonomy of Clones in Source Code: A
Case Study”, International Workshop on Evolution of Large-scale Industrial
Software Applications (ELISA), 2003. [125] C.J. Kapser and M.W. Godfrey, “‘Cloning Considered Harmful’ Considered
Harmful: Patterns of Cloning in Software”, Empirical Software Engineering, Springer, 2008.
[126] C.J. Kapser and M.W. Godfrey, “Supporting the Analysis of Clones in Software
Systems: A Case Study”, Journal of Software Maintenance and Evolution:
Research and Practice (JSME), John Wiley & Sons, 2005. [127] C. Kelleher, D. Cosgrove, D. Culyba, C. Forlines, J. Pratt, and R. Pausch, “Alice2:
Programming without Syntax Errors”, ACM SIGCHI Symposium on User Interface
Software and Technology (UIST), 2002. [128] R. Kerr and W. Stuerzlinger, “Context-Sensitive Cut, Copy, and Paste”, ACM
Canadian Conference on Computer Science & Software Engineering (C3S2E), 2008.
[129] M. Kim, “Analyzing and Inferring the Structure of Code Changes”, University of
Washington, PhD Dissertation, 2008. [130] M. Kim, “Ethnographic Study of Copy and Paste Programming Practices in
OOPL”, University of Washington, Qualification Exam Report, 2003. [131] M. Kim, “Understanding and Aiding Code Evolution by Inferring Change
Patterns”, ACM SIGSOFT-IEEE International Conference on Software
Engineering (ICSE), 2007. [132] M. Kim, L. Bergman, T. Lau, and D. Notkin, “An Ethnographic Study of Copy and
Paste Programming Practices in OOPL”, ACM SIGSOFT-IEEE International
Symposium on Empirical Software Engineering (ISESE), 2004. [133] M. Kim and D. Notkin, “Discovering and Representing Systematic Code Changes”,
ACM SIGSOFT-IEEE International Conference on Software Engineering (ICSE), 2009.
[134] M. Kim and D. Notkin, “Program Element Matching for Multi-Version Program
Analyses”, ACM SIGSOFT-IEEE International Workshop on Mining Software
Repositories (MSR), 2006.
121
[135] M. Kim and D. Notkin, “Using a Clone Genealogy Extractor for Understanding and Supporting Evolution of Code Clones”, ACM SIGSOFT-IEEE International
Workshop on Mining Software Repositories (MSR), 2005. [136] M. Kim, D. Notkin, and D. Grossman, “Automatic Inference of Structural Changes
for Matching Across Program Versions”, ACM SIGSOFT-IEEE International
Conference on Software Engineering (ICSE), 2007. [137] M. Kim, V. Sazawal, D. Notkin, and G.C. Murphy, “An Empirical Study of Code
Clone Genealogies”, ACM SIGSOFT International Symposium on the Foundations
of Software Engineering (FSE), 2005. [138] S. Kim, K. Pan, and E.J. Whitehead, Jr., “When Functions Change Their Names:
Automatic Detection of Origin Relationships”, IEEE Working Conference on
Reverse Engineering (WCRE), 2005. [139] A.J. Ko, H.H. Aung, and B.A. Myers, “Design Requirements for More Flexible
Structured Editors from a Study of Programmers’ Text Editing”, ACM SIGCHI
Conference on Human Factors in Computing Systems (CHI), 2005. [140] A.J. Ko, H.H. Aung, and B.A. Myers, “Eliciting Design Requirements for
Maintenance-Oriented IDEs: A Detailed Study of Corrective and Perfective Maintenance Tasks”, ACM SIGSOFT-IEEE International Conference on Software
Engineering (ICSE), 2005. [141] A.J. Ko and B.A. Myers, “Barista: An Implementation Framework for Enabling
New Tools, Interaction Techniques and Views in Code Editors”, ACM SIGCHI
Conference on Human Factors in Computing Systems (CHI), 2006. [142] R. Komondoor and S. Horwitz, “Effective, Automatic Procedure Extraction”, IEEE
International Workshop on Program Comprehension (IWPC), 2003. [143] R. Komondoor and S. Horwitz, “Semantics-Preserving Procedure Extraction”,
ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
(POPL), 2000. [144] R. Komondoor and S. Horwitz, “Tool Demonstration: Finding Duplicated Code
Using Program Dependencies”, ACM SIGPLAN European Symposium on
Programming (ESOP), 2001. [145] R. Komondoor and S. Horwitz, “Using Slicing to Identify Duplication in Source
Code”, International Symposium on Static Analysis (SAS), 2001. [146] K. Kontogiannis, “Managing Known Clones: Issues and Open Questions”,
Dagstuhl Seminar, 2006.
122
[147] R. Koschke, “Frontiers of Software Clone Management”, IEEE Frontiers of
Software Maintenance at ICSM (FoSM), 2008. [148] R. Koschke, “Survey of Research on Software Clones”, Dagstuhl Seminar, 2006.
(“Software Clone Detection Survey”, Presentation) [149] R. Koschke, A. Lakhotia, E. Merlo, and A. Walenstein, “Duplication, Redundancy,
and Similarity in Software”, Dagstuhl Seminar, 2006. http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=2006301
[150] J. Kramer, “Is Abstraction the Key to Computing?”, Communications of the ACM,
2007. [151] J. Krinke, “A Study of Consistent and Inconsistent Changes to Code Clones”, IEEE
Working Conference on Reverse Engineering (WCRE), 2007. [152] J. Krinke, “Identifying Similar Code with Program Dependence Graphs”, IEEE
Working Conference on Reverse Engineering (WCRE), 2001. [153] J.B. Kruskal, “An Overview of Sequence Comparison: Time Warps, String Edits,
and Macromolecules”, SIAM Review, 1983. [154] V. Kruskal, “Managing Multi-Version Programs with an Editor”, IBM Journal of
Research and Development, 1984. [155] A. Kuhn and S. Ducasse, “Enriching Reverse Engineering with Semantic
Clustering”, IEEE Working Conference on Reverse Engineering (WCRE), 2005. [156] A. Kuhn, S. Ducasse, and T. Girba, “Semantic Clustering: Identifying Topics in
Source Code”, Information and Software Technology, Elsevier, 2006. [157] T. Kuhn and O. Thomann, “Abstract Syntax Tree”, Eclipse Corner Article, 2006.
http://www.eclipse.org/articles/article.php?file=Article-
JavaCodeManipulation_AST/index.html [158] B. Lague, D. Proulx, J. Mayrand, E.M. Merlo, and J. Hudepohl, “Assessing the
Benefits of Incorporating Function Clone Detection in a Development Process”, IEEE International Conference on Software Maintenance (ICSM), 1997.
[159] S. Lammers, “Charles Simonyi- 1986”, Programmers At Work: PAW 1986
Interviews, 1986. http://programmersatwork.wordpress.com/programmers-at-
work-charles-simonyi/ [160] B.M. Lange and T.G. Moher, “Some Strategies of Reuse in an Object-Oriented
Programming Environment”, ACM SIGCHI Conference on Human Factors in
Computing Systems (CHI), 1989.
123
[161] J. Laski and W. Szermer, “Identification of Program Modifications and its Applications in Software Maintenance”, IEEE International Conference on
Software Maintenance (ICSM), 1992. [162] T. LaToza, “A Literature Review of Clone Detection Analysis”, Carnegie Mellon
University, 2005. [163] D. Lawrie, H. Feild, and D. Binkley, “An Empirical Study of Rules for Well-
Formed Identifiers”, Journal of Software Maintenance and Evolution: Research
and Practice (JSME), John Wiley & Sons, 2007. [164] D. Lawrie, H. Feild, and D. Binkley, “Extracting Meaning from Abbreviated
Identifiers”, IEEE International Working Conference on Source Code Analysis and
Manipulation (SCAM), 2007. [165] D. Lawrie, H. Feild, and D. Binkley, “Quantifying Identifier Quality: An Analysis
of Trends”, Empirical Software Engineering, Springer, 2007. [166] D. Lawrie, H. Feild, and D. Binkley, “Syntactic Identifier Conciseness and
Consistency”, IEEE International Workshop on Source Code Analysis and
Manipulation (SCAM), 2006. [167] D. Lawrie, C. Morrell, H. Feild, and D. Binkley, “Effective Identifier Names for
Comprehension and Memory”, Innovations in Systems and Software Engineering, Springer, 2007.
[168] D. Lawrie, C. Morrell, H. Feild, and D. Binkley, “What's in a Name? A Study of
Identifiers”, IEEE International Conference on Program Comprehension (ICPC), 2006.
[169] N.G. Leveson, “Intent Specifications: An Approach to Building Human-Centered
Specifications”, IEEE Transactions on Software Engineering (TSE), 2000. [170] C. Lewis, “Some Learnability Results for Analogical Generalization”, University of
Colorado at Boulder, 1988. [171] Z. Li, S. Lu, S. Myagmar, and Y. Zhou, “CP-Miner: A Tool for Finding Copy-paste
and Related Bugs in Operating System Code”, USENIX-ACM SIGOPS Symposium
on Operating Systems Design and Implementation (OSDI), 2004. [172] Z. Li, S. Lu, S. Myagmar, and Y. Zhou, “CP-Miner: Finding Copy-Paste and
Related Bugs in Large-Scale Software Code”, IEEE Transactions on Software
Engineering (TSE), 2006.
124
[173] Z. Li and Y. Zhou, “PR-Miner: Automatically Extracting Implicit Programming Rules and Detecting Violations in Large Software Code”, ACM SIGSOFT
International Symposium on the Foundations of Software Engineering (FSE), 2005. [174] B. Liblit, A. Aiken, A.X. Zheng, and M.I. Jordan, “Bug Isolation via Remote
Program Sampling”, ACM SIGPLAN Conference on Programming Language
Design and Implementation (PLDI), 2003. [175] B. Liskov and J. Guttag, Program Development in Java: Abstraction, Specification,
and Object-Oriented Design, Addison-Wesley, 2001. [176] C. Liu, C. Chen, J. Han, and P. Yu, “GPLAG: Detection of Software Plagiarism by
Program Dependence Graph Analysis”, ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (KDD), 2006. [177] A. Lozano, “A Methodology to Assess the Impact of Source Code Flaws in
Changeability, and its Application to Clones”, IEEE International Conference on
Software Maintenance (ICSM), 2008. [178] A. Lozano and M. Wermelinger, “Assessing the Effect of Clones on
Changeability”, IEEE International Conference on Software Maintenance (ICSM), 2008.
[179] A. Lozano, M. Wermelinger, and B. Nuseibeh, “Assessing the Impact of Bad
Smells using Historical Information”, ACM SIGSOFT-IEEE International
Workshop on Principles of Software Evolution (IWPSE), 2007. [180] A. Lozano, M. Wermelinger, and B. Nuseibeh, “Evaluating the Harmfulness of
Cloning: A Change Based Experiment”, ACM SIGSOFT-IEEE International
Workshop on Mining Software Repositories (MSR), 2007. [181] D. Mandelin, L. Xu, R. Bodik, and D. Kimelman, “Jungloid Mining: Helping to
Navigate the API Jungle”, ACM SIGPLAN Conference on Programming Language
Design and Implementation (PLDI), 2005. [182] Z.A. Mann, “Three Public Enemies: Cut, Copy, and Paste”, IEEE Computer
Magazine, 2006. [183] R. McCauley, S. Fitzgerald, G. Lewandowski, L. Murphy, B. Simon, L. Thomas
and C. Zander, “Debugging: A Review of the Literature from an Educational Perspective”, Computer Science Education, Taylor & Francis, 2008.
[184] T. Mende, F. Beckwermert, R. Koschke, and G. Meier, “Supporting the Grow-and-
Prune Model in Software Product Lines Evolution Using Clone Detection”, IEEE
European Conference on Software Maintenance and Reengineering (CSMR), 2008.
125
[185] T. Mende, R. Koschke, and F. Beckwermert, “An Evaluation of Code Similarity Identification for the Grow-and-Prune Model”, Journal of Software Maintenance
and Evolution: Research and Practice (JSME), John Wiley & Sons, 2009. [186] Y.M. Mileva, “Learning from Deletions”, ACM SIGSOFT International
Symposium on the Foundations of Software Engineering (FSE), 2008. [187] Y.M. Mileva and A. Zeller, “Project-Specific Deletion Patterns”, International
Workshop on Recommendation Systems for Software Engineering (RSSE), 2008. [188] P. Miller, J. Pane, G. Meter, and S. Vorthmann, “Evolution of Novice
Programming Environments: The Structure Editors of Carnegie Mellon University”, Interactive Learning Environments (ILE), 1994.
[189] R.C. Miller and B.A. Myers, “Interactive Simultaneous Editing of Multiple Text
Regions”, USENIX Annual Technical Conference, 2001. [190] F. Mitter, “Tracking Source Code Propagation in Software Systems via Release
History Data and Code Clone Detection”, Vienna University of Technology, Diploma Thesis, 2006.
[191] A. Monden, D. Nakae, T. Kamiya, S. Sato, and K. Matsumoto, “Software Quality
Analysis by Code Clones in Industrial Legacy Software”, IEEE International
Symposium on Software Metrics (METRICS), 2002. [192] H.L. Morgan, “Spelling Correction in Systems Programs”, Communications of the
ACM, 1970. [193] H.A. Nguyen, T.T. Nguyen, N.H. Pham, J.M. Al-Kofahi, and T.N. Nguyen,
“Accurate and Efficient Structural Characteristic Feature Extraction for Clone Detection”, European Joint Conferences on Theory and Practice of Software
(ETAPS) International Conference on Fundamental Approaches to Software
Engineering (FASE), 2009. [194] T.T. Nguyen, H.A. Nguyen, J.M. Al-Kofahi, N.H. Pham, and T. Nguyen, “Scalable
and Incremental Clone Detection for Evolving Software”, IEEE International
Conference on Software Maintenance (ICSM), 2009. [195] T.T. Nguyen, H.A. Nguyen, N.H. Pham, J.M. Al-Kofahi, and T.N. Nguyen,
“Cleman: Comprehensive Clone Group Evolution Management”, ACM SIGSOFT-
SIGART-IEEE International Conference on Automated Software Engineering
(ASE), 2008. [196] T.T. Nguyen, H.A. Nguyen, N.H. Pham, J.M. Al-Kofahi, and T.N. Nguyen,
“ClemanX: Incremental Clone Detection Tool for Evolving Software”, ACM
SIGSOFT-IEEE International Conference on Software Engineering (ICSE), 2009.
126
[197] T.T. Nguyen, H.A. Nguyen, N.H. Pham, J.M. Al-Kofahi, and T.N. Nguyen, “Clone-aware Configuration Management”, ACM SIGSOFT-SIGART-IEEE
International Conference on Automated Software Engineering (ASE), 2009. [198] S. Niezgoda and T.P. Way, “SNITCH: A Software Tool for Detecting Cut and
Paste Plagiarism”, ACM SIGCSE Technical Symposium on Computer Science
Education, 2006. [199] D. Notkin, “The GANDALF Project”, The Journal of Systems and Software,
Elsevier, 1985. [200] D. Notkin, N. Habermann, R. Ellison, G. Kaiser, and D. Garlan, “Letter to the
Editor”, ACM SIGPLAN Notices, 1983. [201] J. Rasmussen, “Models of Mental Strategies in Process Plant Diagnosis”, Human
Detection and Diagnosis of System Failures, Plenum Press, 1981. [202] J. Reason, Human Error, Cambridge University Press, 1990. [203] S.P. Reiss, “Pecan: Program Development Systems that Support Multiple Views”,
ACM SIGSOFT-IEEE International Conference on Software Engineering (ICSE), 1984.
[204] S.P. Reiss, “PECAN: Program Development Systems that Support Multiple
Views”, IEEE Transactions on Software Engineering (TSE), 1985. [205] S.P. Reiss, “Tracking Source Locations”, ACM SIGSOFT-IEEE International
Conference on Software Engineering (ICSE), 2008. [206] C. Rich and H.E. Shrobe, “Initial Report on a Lisp Programmer’s Apprentice”,
IEEE Transactions on Software Engineering (TSE), 1978. [207] M. Rieger, “Effective Clone Detection Without Language Barriers”, University of
Bern, PhD Dissertation, 2005. [208] M. Rieger, S. Ducasse, and M. Lanza, “Insights into System-Wide Code
Duplication”, IEEE Working Conference on Reverse Engineering (WCRE), 2004. [209] E.L. Rissland, “Examples and Learning Systems”, Adaptive Control of Ill-Defined
Systems, Plenum Press, 1984. [210] M.B. Rosson and J.M. Carroll, “The Reuse of Uses in Smalltalk Programming”,
ACM Transactions on Computer-Human Interaction (TOCHI), 1996. [211] C.K. Roy and J.R. Cordy, “A Survey on Software Clone Detection Research”,
Queen’s University, Technical Report 2007-541, 2007.
127
[212] C.K. Roy and J.R. Cordy, “NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization”, IEEE
International Conference on Program Comprehension (ICPC), 2008. [213] C.K. Roy and J.R. Cordy, “Scenario-Based Comparison of Clone Detection
Techniques”, IEEE International Conference on Program Comprehension (ICPC), 2008.
[214] C.K. Roy, J.R. Cordy, and R. Koschke, “Comparison and Evaluation of Code
Clone Detection Techniques and Tools: A Qualitative Approach”, Science of
Computer Programming, Elsevier, 2009. [215] T. Sager, “Coogle: A Code Google Eclipse Plug-in for Detecting Similar Java
Classes”, University of Zurich, Diploma Thesis, 2006. [216] T. Sager, A. Bernstein, M. Pinzger, and C. Kiefer, “Detecting Similar Java Classes
Using Tree Algorithms”, ACM SIGSOFT-IEEE International Workshop on Mining
Software Repositories (MSR), 2006. [217] N. Sahavechaphan and K. Claypool, “XSnippet: Mining For Sample Code”, ACM
SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and
Applications (OOPSLA), 2006. [218] S. Schleimer, D.S. Wilkerson, and A. Aiken, “Winnowing: Local Algorithms for
Document Fingerprinting”, ACM SIGMOD International Conference on
Management of Data (SIGMOD) and Principles of Database Systems (PODS), 2003.
[219] M.M. Schrage, “Proxima - A Presentation-Oriented Editor for Structured
Documents”, Utrecht University, PhD Dissertation, 2004. [220] C. Simonyi, “Program Identifier Naming Conventions (Hungarian Notation)”,
Microsoft Corporation, 1999. http://msdn2.microsoft.com/en-
us/library/aa260976%28VS.60%29.aspx [221] C. Simonyi, “Programmers at Work - Follow Up”, Intentional Software
Corporation, 2008. http://blog.intentsoft.com/intentional_software/2008/06/programmers-at.html
[222] C. Simonyi, M. Christerson, and S. Clifford, “Intentional Software”, ACM
SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and
Applications (OOPSLA), 2006. [223] R. Smith and S. Horwitz, “Detecting and Measuring Similarity in Code Clones”,
International Workshop on Software Clones (IWSC), 2009.
128
[224] M. Sojer and J. Henkel, “Code Reuse in Open Source Software Development: Quantitative Evidence, Drivers, and Impediments”, University of Technology,
Munich, Working Paper, 2009. [225] M. Sojer and J. Henkel, “Reuse in Open Source Projects Survey: Descriptive
Results”, University of Technology, Munich, Research Project, 2009. [226] A. Solovey, “A Few Words In Defense of Copy And Paste Programming”,
Software Creation Mystery, 2008. http://www.softwarecreation.org/2008/a-few-
words-in-defense-of-copy-and-paste-programming/ [227] R. Tairas, “Clone Maintenance through Analysis and Refactoring”, ACM SIGSOFT
International Symposium on the Foundations of Software Engineering (FSE), 2008. [228] R. Tairas, J. Gray, and I. Baxter, “Visualization of Clone Detection Results”,
Eclipse Technology Exchange Workshop at OOPSLA (ETX), 2006. [229] T. Teitelbaum and T. Reps, “The Cornell Program Synthesizer: A Syntax-Directed
Programming Environment”, Communications of the ACM, 1981. [230] R.M.H. Ting and J. Bailey, “Mining Minimal Contrast Subgraph Patterns”, SIAM
International Conference on Data Mining (SDM), 2006. [231] M. Toomim, A. Begel, and S.L. Graham, “Managing Duplicated Code with Linked
Editing”, IEEE Symposium on Visual Languages and Human-Centric Computing
(VLHCC), 2004. [232] Y. Ueda, T. Kamiya, S. Kusumoto, and K. Inoue, “Gemini: Maintenance Support
Environment Based on Code Clone Analysis”, IEEE International Symposium on
Software Metrics (METRICS), 2002. [233] J.R. Ullmann, “An Algorithm for Subgraph Isomorphism”, Journal of the ACM
(JACM), 1976. [234] Various Authors, “Clone And Modify Programming”, Cunningham &
Cunningham, Inc. Wiki, 2006-Present. http://www.c2.com/cgi/wiki?CloneAndModifyProgramming
[235] Various Authors, “Code Smell”, Cunningham & Cunningham, Inc. Wiki, 2006-
Present. http://www.c2.com/cgi/wiki?CodeSmell [236] Various Authors, “Copy And Paste”, Cunningham & Cunningham, Inc. Wiki, 2006-
Present. http://www.c2.com/cgi/wiki?CopyAndPaste
129
[237] Various Authors, “Copy And Paste Programming”, Cunningham & Cunningham,
Inc. Wiki, 2006-Present. http://www.c2.com/cgi/wiki?CopyAndPasteProgramming
[238] Various Authors, “Copy and Paste Programming”, NationMaster Encyclopedia,
2003-Present. http://www.nationmaster.com/encyclopedia/Copy-and-paste-
programming [239] Various Authors, “Copy and Paste Programming”, Wikipedia Encyclopedia, 2003-
Present. http://en.wikipedia.org/wiki/Copy_and_paste_programming [240] Various Authors, “Cut, Copy, and Paste”, Wikipedia Encyclopedia, 2002-Present.
http://en.wikipedia.org/wiki/Copy_and_paste [241] Various Authors, “Diff”, Wikipedia Encyclopedia, 2002-Present.
http://en.wikipedia.org/wiki/Diff [242] Various Authors, “Don’t Repeat Yourself (DRY)”, Cunningham & Cunningham,
Inc. Wiki, 2006-Present. http://www.c2.com/cgi/wiki?DontRepeatYourself [243] Various Authors, “Duplicated Code”, Cunningham & Cunningham, Inc. Wiki,
2006-Present. http://www.c2.com/cgi/wiki?DuplicatedCode [244] Various Authors, “Once And Only Once”, Cunningham & Cunningham, Inc. Wiki,
2006-Present. http://www.c2.com/cgi/wiki?OnceAndOnlyOnce [245] Various Authors, “Rogue Tile”, Cunningham & Cunningham, Inc. Wiki, 2006-
Present. http://www.c2.com/cgi/wiki?RogueTile [246] Various Authors, “Rule of Three (Programming)”, Wikipedia Encyclopedia, 2008-
Present. http://en.wikipedia.org/wiki/Rule_of_three_(programming) [247] Various Authors, “Three Strikes And You Refactor”, Cunningham & Cunningham,
Inc. Wiki, 2006-Present. http://www.c2.com/cgi/wiki?ThreeStrikesAndYouRefactor
[248] I. Vessey, “Expertise in Debugging Computer Programs: A Process Analysis”,
International Journal of Man-Machine Studies (IJMMS), Academic Press, 1985. [249] T.A. Wagner and S.L. Graham, “Efficient and Flexible Incremental Parsing”, ACM
Transactions on Programming Languages and Systems (TOPLAS), 1998. [250] R.C. Waters, “Program Editors Should Not Abandon Text Oriented Commands”,
ACM SIGPLAN Notices, 1982.
130
[251] V. Weckerle, “CPC: An Eclipse Framework for Automated Clone Life Cycle Tracking and Update Anomaly Detection”, Free University of Berlin, Master’s Thesis, 2008.
[252] T. Widmer, “Unleashing the Power of Refactoring”, Eclipse Corner Article, 2007.
http://www.eclipse.org/articles/article.php?file=Article-Unleashing-the-Power-
of-Refactoring/index.html [253] P.H. Winston, “Learning and Reasoning by Analogy”, Communications of the
ACM, 1980. [254] C. Wohlin, P. Runeson, M. Hoest, M.C. Ohlsson, B. Regnell, and A. Wesslen,
Experimentation in Software Engineering: An Introduction, Kluwer Academic
Publishers, 2000. [255] Y. Xie and D. Engler, “Using Redundancies to Find Errors”, IEEE Transactions on
Software Engineering (TSE), 2003. [256] Z. Xing and E. Stroulia, “Analyzing the Evolutionary History of the Logical Design
of Object-Oriented Software”, IEEE Transactions on Software Engineering (TSE), 2005.
[257] Z. Xing and E. Stroulia, “Differencing Logical UML Models”, Automated Software
Engineering, Kluwer Academic Publishers, 2007. [258] Z. Xing and E. Stroulia, “UMLDiff: An Algorithm for Object-Oriented Design
Differencing”, ACM SIGSOFT-SIGART-IEEE International Conference on
Automated Software Engineering (ASE), 2005. [259] W. Yang, “Identifying Syntactic Differences Between Two Programs”, Software -
Practice & Experience, John Wiley & Sons, 1991. [260] G. Yarmish and D. Kopec, “Revisiting Novice Programmer Errors”, ACM SIGCSE
Bulletin, 2007. [261] F.K. Zadeck, “Incremental Data Flow Analysis in a Structured Program Editor”,
ACM SIGPLAN Symposium on Compiler Construction (SCC), 1984. [262] T. Zimmermann, P. Weissgerber, S. Diehl, and A. Zeller, “Mining Version
Histories to Guide Software Changes”, ACM SIGSOFT-IEEE International
Conference on Software Engineering (ICSE), 2004.
131
Appendix A
IRB Recruitment Letter
Email to Potential Subjects for Recruitment Dear Colleagues, Do you often copy and paste code during programming? Do you want to learn how this process can be better supported with software development tools and IDEs? We have developed a software tool that helps programmers manage their copying and pasting activity. We are looking for programmers to participate in a voluntary user study of the tool features. The study will take approximately 1.5 hours. It will consist of us giving you an overview of the tool and its purpose (in less than 30 minutes) followed by four pairs of (that is eight) small programming tasks for you to complete within 1 hour. There will be a time limit associated with each task. For the first four tasks, you will be given about 10 minutes each to work on and about 5 minutes for each of the four other tasks. The tasks are in the domain of GUI programming and should be easy to do for anyone with a moderate amount of experience. The goal of this study is to test the software features we developed, not anything about you. At the end of your participation, you will receive $10 for having actively completed the study from the beginning to the end. You may choose to quit from the study anytime you want. If you quit, you will receive NO compensation. At the end of this whole study (that is, after we conduct all the sessions with all participants), we will select FOUR participants who have conducted the programming tasks with the best accuracy and speed and award each of them another $20. (In case more than four individuals qualify for awards, we will select the four who participate in our study the earliest.) Our selection of the four awardees will be totally based on our best judgment and observations. Our decision will be final. In order to participate, you must be able to read and write simple Java programs and have experience with Swing (graphical user interface programming). Familiarity with integrated development environments (IDEs), especially Eclipse, is preferred, but not necessary. If you are interested in participating in this study (or have any questions), please reply by email to both Dr. Daqing Hou (dhou@clarkson.edu) and Patricia Deshane (deshanpa@clarkson.edu) with a short description of your experience with Java, Swing, and Eclipse. If you qualify for participation, we will contact you to schedule a time to conduct the study as well as ask for a few more pieces of information about you (full name, age, gender, major, year of study, and contact info). Finally, thank you for considering participation. Your effort will make a difference in our research and is greatly appreciated by both of us. Sincerely, Patricia Deshane (Jablonski), PhD candidate of Engineering Science Daqing Hou, Professor of Electrical and Computer Engineering
132
Appendix B
IRB Consent Form
Clarkson University
Documentation of Informed Consent to Participate in Research
Project Title: Managing the Copy-and-Paste Programming Practice
Researcher(s): Patricia Deshane (Jablonski), PhD Candidate Daqing Hou, Advisor Institutional Review Board (IRB) approval number: 09-27 Approval valid until: Fall 2009 You have been asked to be a part of the research described here. Participation is voluntary. The purpose of this study: We would like to evaluate our software tool with users. What to expect: You will be given an overview of our software (in less than 30 minutes), followed by a set of eight programming tasks to complete (within 1 hour). We will capture your activity on the computer with screen-capturing software and record your voice. We ask that you talk aloud about what you are thinking of, while completing the task. We will ask you questions related to your experience after the tasks are completed. The entire study will take about 1.5 hours. If you have questions about this research, you may contact Daqing Hou (dhou@clarkson.edu) or Patricia Deshane (deshanpa@clarkson.edu). Risks and discomforts to you if you take part in this study: There are no more risks and discomforts to you than the normal student activities. You may quit the study at any time. The benefits to you if you take part in this study: You will become the first user of our software tool and learn about the copy-and-paste programming practice, which may be beneficial to you in your own programming tasks. What will you receive for taking part in this study: You will be compensated $10 for participating in this study. Four top participants who have completed the task with the best accuracy and efficiency will be selected and awarded $20 each. You may choose to quit from the study whenever you would like to. If you quit, you will not receive the compensation. Therefore, you may receive $0 minimally and $30 maximally from this study. What will happen to the information collected in this study: The information collected will be kept confidential as much as is permitted by law. We will not use personally identifiable information (like names, etc.) in our recorded data or report of the study’s results.
What rights do you have when you take part in this study: Participation in this research is voluntary. Deciding not to take part or to stop being a part of this research will result in no penalty, fine or loss of benefits which you otherwise have a right to. If you have questions about your rights as a research subject or if you wish to report any harm, injury, risk or other concern, please contact Dr. Leslie Russek, Chair of the Clarkson University Institutional Review Board (IRB) for human subjects research: (315)268-3761 or Lnrussek@clarkson.edu. Conflict of Interest: The researchers have no financial interest in performing this study.
133
Informed Consent: Please sign here to show you have had the purpose of this research explained and you have been informed of what to expect and your rights. You should have all your questions answered to your satisfaction. Your signature shows that you agree to take part in this research. By signing below you also attest that you are at least 18 years old. You will be given a copy of this consent form to keep for your records.
Signature of volunteer: Date:
Signature of researcher
obtaining informed consent: Date:
134
Appendix C
IRB Questionnaire
User Experience Questionnaire
Subject # _______ 1) Describe your experience with Java and Swing programming (how many months, how
many lines of code written (best estimation), in what kinds of projects, for course work or industry, and so on).
2) Describe your experience with using the Eclipse integrated development environment
before participating in this study (for how long, in what kinds of projects, and so on). Which other IDEs are you familiar with? (Netbeans, Visual Studio, etc.)
3) Describe your own practice of copy-and-paste (never; seldom; often. Anecdotes about its
advantages or disadvantages are particularly welcome). Do you know or use clone detection tools before this study?
4) What is your experience with CCFinder? How helpful is it in the debugging task?
135
5) How do you normally debug Java source code? What tools do you use when debugging? 6) Describe your own experience with renaming before this study. What tools do you
normally use when renaming? (Manually, Find & Replace, Rename Refactoring, etc.) 7) What were the most frustrating parts about completing these tasks (with and without our
tool)? 8) What did you like about any of our tool’s features (clone tracking, CReN, LexId)? 9) What did you dislike about any of our tool’s features (clone tracking, CReN, LexId)? 10) Would you use any of these features while you write software? Which ones?
136
VITA Patricia Deshane (maiden name: Jablonski) is originally from Amsterdam, New
York. She was a student at Clarkson University in Potsdam, New York from August 1998
until May 2010. She graduated with a Bachelor of Science in Mathematics in 2002. She
then got two Master’s degrees: a Master of Business Administration in 2003 and a Master
of Science in Information Technology in 2004. Finally, she completed her Ph.D. in
Engineering Science in 2010. Her research interests in the area of software engineering
include software maintenance, software quality, and source code management.
Recommended