Upload
ptidej-team
View
49
Download
0
Tags:
Embed Size (px)
Citation preview
Recognizing Words from Source Code Recognizing Words from Source Code Identifiers using Speech Recognition Identifiers using Speech Recognition Identifiers using Speech Recognition Identifiers using Speech Recognition
Techniques Techniques
CSMR 2010, MadridCSMR 2010, Madrid
Nioosha Madani, Latifa Guerrouj, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol
ContentContent
�� Problem StatementProblem Statement
�� Aligning Strings and WordsAligning Strings and Words
�� MetaMeta--heuristic Inspired Approachheuristic Inspired Approach
2/24
CSMR 2010, Madrid
�� TechnologiesTechnologies
�� Case Study Case Study –– Research QuestionsResearch Questions
�� Case Study Case Study –– ResultsResults
�� Conclusion and Future WorkConclusion and Future Work
The ChallengeThe Challenge
�� A few years after deployment, documentation may A few years after deployment, documentation may
no longer exist.no longer exist.
�� If it exists, it will be almost surely outdated.If it exists, it will be almost surely outdated.
Problem Statement
3/24
CSMR 2010, Madrid
�� My customers desire to change the system, add My customers desire to change the system, add
new functionalities or fix a defect.new functionalities or fix a defect.
�� The only available source of information is the The only available source of information is the
code:code:
� Identifiers;
� Comments.
Identifiers SemanticIdentifiers Semantic
�� Researchers agree that the identifier semantics are Researchers agree that the identifier semantics are
important:important:
� Help program comprehension;
� Suggest clues.
Problem Statement
4/24
CSMR 2010, Madrid
�� Composed identifiers:Composed identifiers:
� Camel Case: MyLocalAccount , User_Address
� Contraction based: pntrctr , usrAdrss , imagEdge
� Good and possibly known to the developers:
hmmm, ixoth , pqrstuvwxyz
Words, Terms, Soft, and Hard Words
�� Term: Term: any substring in a compound identifier.any substring in a compound identifier.
�� Word: Word: an entry in a dictionary (e.g., the English an entry in a dictionary (e.g., the English
dictionary).dictionary).
Problem Statement
5/24
CSMR 2010, Madrid
�� Hard words: Hard words: terms composing an identifier reflecting terms composing an identifier reflecting
domain concepts, clearly demarked: domain concepts, clearly demarked:
�� baseAddressbaseAddress, , user_fileuser_file
�� Soft words: Soft words: terms different from the identifier and not terms different from the identifier and not
clearly demarked (e.g., abbreviation, contraction, clearly demarked (e.g., abbreviation, contraction,
etc.):etc.):
�� userareauserarea, , ptrcntrptrcntr, , userGiduserGid
Current Practices
�� Camel CaseCamel Case--based approaches plus greedy based approaches plus greedy
algorithms, e.g., algorithms, e.g., LawrieLawrie et al. 2006, 2007.et al. 2006, 2007.
�� Samurai by Samurai by EnslenEnslen et al, 2009:et al, 2009:
� Lexicon plus a greedy algorithm;
Problem Statement
6/24
CSMR 2010, Madrid
� If a contraction is used somewhere in the code then it is
likely used in the same context than the original term;
� Frequency tables of contractions and terms to split
composed identifiers.
�� LimitationsLimitations : Abbreviations not treated, no : Abbreviations not treated, no
quantification of how close the match is to the quantification of how close the match is to the
unknown string.unknown string.
Our Approach in EssenceOur Approach in Essence
�� Developers compose identifiers:Developers compose identifiers:
� Using terms and words reflecting domain concepts,
developer’s experience, knowledge.
�� Developers generate contraction via a finite set of Developers generate contraction via a finite set of transformation rules:transformation rules:
Problem Statement
7/24
CSMR 2010, Madrid
transformation rules:transformation rules:
� Drop all vowels, drop prefix, drop suffix, etc.
�� Mimics developer’s identifiers generation process:Mimics developer’s identifiers generation process:
�� Dictionaries capturing terms and words;
� A search-based technique to split exactly any unknown
string;
� A distance using Dynamic Time Warping (DTW) for
continuous speech recognition [H. Ney, 1984].
Modified H. Ney DTW
C t
r
U s
r
3
2
3
2
1
4
3
4
3
2
5
4
5
4
3
4
3
4
3
2
2
1
1
0
0
1
2
1
0
1
0
1
0
1
2
3
2
4
3
5
4
5
4
3
4
3
2
3
3
2
Dic
tio
nary
of
3 w
ord
s
Aligning Strings and Words
8/24
CSMR 2010, Madridp n t r c t r u s r
P n
t
r
C t
r
3
2
1
0
3
2
0
1
2
0
1
2
0
1
2
3
3
2
2
1
2
1
3
2
4
3
3
2
1
0
0
1
1
22
1
3
2
4
3
3
2
3
2
2
3
3
2
4
3
2
1
5
4
3
2
4
4
3
2
Identifier to split : pntrctrusr
Dic
tio
nary
of
3 w
ord
s
Word Transformation Rules
� Drop all vowels
� Drop a random vowel
Constraint: String must remain longer or equal to 3 chars
pointer → pntr
user → usr
Meta-heuristic InspiredApproach
9/24
CSMR 2010, Madrid
� Drop a random character
� Drop suffix (ing, tion, ed,
ment, able)
� Drop the last m characters
pntr → ptr
rectangle → rect
user → usr
available → avail
Overall Splitting (Hill Climbing) Procedure
DTW Match
No
Success!
Select randomly a
word with a minimal
distance <> 0
Best Matching
Zero Dist?
Identifier
- Meta-heuristic InspiredApproach
-Technologies
10/24
CSMR 2010, Madrid
distance <> 0
Apply a random
transformation to the
chosen wordAdd transf word to
temporary
dictionary
DTW
Matchred Dist ?
yes
Best Matching
If other transf to applyNo
Discard word
from temporary
dictionary
Current dictionary
Case Study - Research Questions
�� RQ1RQ1: : What is the percentage of identifiers What is the percentage of identifiers
correctly split by the proposed approach?correctly split by the proposed approach?
�� RQ2RQ2: : How does the proposed approach perform How does the proposed approach perform
Case Study – ResearchQuestions
11/24
CSMR 2010, Madrid
�� RQ2RQ2: : How does the proposed approach perform How does the proposed approach perform
compared with the Camel Case splitter?compared with the Camel Case splitter?
�� RQ3RQ3: : What percentage of identifiers containing What percentage of identifiers containing
word abbreviations is the approach able to word abbreviations is the approach able to
map to dictionary words?map to dictionary words?
Case Study - Results
� JHotDraw – Java
� 16 KLOC
� 155 files
� 2,348 identifiers (longer than 2 chars)
� 957 manually segmented identifiers
Case Study – Results
12/24
CSMR 2010, Madrid
� 957 manually segmented identifiers
� Lynx – C
� 174 KLOC
� 247 files
� 12,194 identifiers (longer than 2 chars)
� 3,085 manually segmented identifiers
RQ1 - Percentage of Correct Classifications
Splits Ids Single iteration
Multiple iterations
Errors
JHotDraw 957 891 (93%) 920 (95%) 37
Systems
Case Study – Results
13/24
CSMR 2010, Madrid
Lynx 3,085 2,169 (70%) 2,901 (94%) 271
Typical cases where the approach failed:
afaik, ihmo, foobar, fsize …
RQ2 - Camel Case Split
Ids Correct Split Errors
JHotDraw 957 874 (91%) 83
Lynx 3,085 561 (18%) 2,524
Splits
Systems
Case Study – Results
14/24
CSMR 2010, Madrid
Statistical comparison (Fisher’s exact test) with our approach:
Null Hypothesis (H0) : The propotions of correct splittings
obtained by the approaches are not significantly <>.
• JHotDraw: Odds Ratio = 1.3, p-value = 0.1
• Lynx: Odds Ratio = 60, p-value < 0.001
RQ3 - Percentage of Correctly Split Id (s)
Ids Correct Split Errors
JHotDraw 957 920 (95%) 37
Splits
Systems
Case Study – Results
15/24
CSMR 2010, Madrid
Lynx 3,085 2,901 (94%) 271
The novel identifier splitting approach perfoms
better than the Camel Case splitter.
Multiple Possible Splits - Successes
borddec
anchorlen
drawrect
drawroundrect
fillrect
javadrawapp
bord decimal
anchor length
draw rectangle
draw round rectangle
fill rectangle
java draw apply
bord decision
anchor lender
java draw append
Case Study – Results
16/24
CSMR 2010, Madrid
javadrawapp
netapp
newlen
nothingapp
addcolumninfo
addlbl
casecomp
java draw apply
net apply
new length
nothing apply
add column information
add label
case compare
java draw append
net append
new lender
nothing application
add column inform
case complete
Max of 10000 iterations
Multiple Possible Splits - Failures
serialversionuid
selectionzordered
removefrfigurerequestremove
jhotdraw
getvadjustable
fimagewidth
serial version did
selection ordered
remove figure request remove
hot draw
get bad just able
him age width
Case Study – Results
17/24
CSMR 2010, Madrid
fimagewidth
fimageheight
writeref
him age width
him age height
write red
DTW does not account for context, syntax or semantic
Max of 10000 iterations
Discussion - Challenges
�� How can we expand How can we expand fwritefwrite or or pdrawpdraw??
�� How can we avoid expanding How can we avoid expanding FileLen FileLen into into File File
Lender Lender rather than rather than File LengthFile Length??
Case Study – Results
18/24
CSMR 2010, Madrid
Lender Lender rather than rather than File LengthFile Length??
�� How can we recognize that How can we recognize that ImagEdit ImagEdit has a correct has a correct
split at distance 1 and not 0?split at distance 1 and not 0?
�� How can we expand/split How can we expand/split pqrstuvwxyzpqrstuvwxyz??
Threats to Validity
�� External validity: External validity:
� We analyzed only two systems;We analyzed only two systems;
�� However: different domains, different programming languages.However: different domains, different programming languages.
�� Construct validity: Construct validity: errors may be present in the oracle!errors may be present in the oracle!
�� We detected 1% error in the first oracle release;We detected 1% error in the first oracle release;
�� We did the best to guess programmer intention but we cannot We did the best to guess programmer intention but we cannot
Case Study – Results
19/24
CSMR 2010, Madrid
�� We did the best to guess programmer intention but we cannot We did the best to guess programmer intention but we cannot
exclude errors.exclude errors.
�� Reliability validity:Reliability validity: replication package available.replication package available.
�� Internal validity: Internal validity: subjectivity and bias in building the oracle:subjectivity and bias in building the oracle:
�� The same researcher built both oracles;The same researcher built both oracles;
�� Oracles were validated by other two researchers;Oracles were validated by other two researchers;
�� Size of oracle large enough to avoid a few percent errors change Size of oracle large enough to avoid a few percent errors change conclusions.conclusions.
Conclusion
�� We presented a searchWe presented a search--based approach to based approach to
automatically segment source code identifiers.automatically segment source code identifiers.
�� The novel approach is inspired by the developerThe novel approach is inspired by the developer
behavior when composing identifiers.behavior when composing identifiers.
Conclusion and Future Work
20/24
CSMR 2010, Madrid
behavior when composing identifiers.behavior when composing identifiers.
�� The approach uses a dictionary, a distance computed The approach uses a dictionary, a distance computed
via DTW, and a set of word transformations.via DTW, and a set of word transformations.
�� Results on Results on JHotDrawJHotDraw and Lynx show the superiority and Lynx show the superiority
of the approach over a simple Camel Case splitter.of the approach over a simple Camel Case splitter.
Future Work
We plan toWe plan to::
�� Expand the evaluation to other systems.Expand the evaluation to other systems.
�� Introduce enhanced heuristics for term selection Introduce enhanced heuristics for term selection
Conclusion and Future Work
21/24
CSMR 2010, Madrid
�� Introduce enhanced heuristics for term selection Introduce enhanced heuristics for term selection
and word transformations.and word transformations.
�� Contextualize our search by coupling our Contextualize our search by coupling our
algorithm with the approach of algorithm with the approach of EnslenEnslen et al.et al.
[ELK, 2009][ELK, 2009](restrict the search to the words used (restrict the search to the words used
in the same method, class, or package).in the same method, class, or package).
Finally… Questions
22/24
CSMR 2010, Madrid
Thank you for your attention
References
[ELK, 2009] E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker,
“Mining source code to automatically split identifiers for software
analysis,” Mining Software Repositories, International Workshop on,
vol. 0, pp. 71 - 80, 2009.
[H. Ney, 1984] H. Ney, “The use of a one-stage dynamic programming
algorithm for connected word recognition,” Acoustics, Speech and
23/24
CSMR 2010, Madrid
algorithm for connected word recognition,” Acoustics, Speech and
Signal Processing, IEEE Transactions on, vol. 32, no. 2, pp. 263 - 271,
Apr 1984.
D. Lawrie, C. Morrell, H. Feild, and D. Binkley, “Effective identifier
names for comprehension and memory,” Innovations in Systems and
Software Engineering, vol. 3, no. 4, pp. 303 - 318, 2007.
D. Lawrie, C. Morrel, H. Feild, and D. Binkley, “What’s in a name? a
study of identifiers,” in Proc. of the International Conference on
Program Comprehension (ICPC), 2006, pp. 3 - 12.
Overall Splitting (Hill Climbing) Procedure
DTW
Match
Ranked
Word List
Identifier
Best MatchingZero Dist?
Success!
Yes
No
No
24/24
CSMR 2010, Madrid
Temporary
Dictionary
Word ListImproved?
Discard word
and create new
dictionary
Save word and
create new
dictionary
Yes
Dictionary
No