36
Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools Zak Fry

Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

  • Upload
    herve

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools. Zak Fry. Outline. Problem and Motivation Automatically Identifying Abbreviation Expansions A Scoped Approach Analysis and Refinement: iScope Evaluations Conclusions. Maintenance Tasks. - PowerPoint PPT Presentation

Citation preview

Page 1: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Improving Automatic Abbreviation Expansion within Source Code to

Aid in Program Search Tools

Zak Fry

Page 2: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Outline

Problem and Motivation Automatically Identifying Abbreviation

Expansions A Scoped Approach Analysis and Refinement: iScope Evaluations Conclusions

Page 3: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Maintenance Tasks

60-90% of software lifecycle

Problem: id where relevant code is – where changes need to be made

Code to perform a certain task can be very scattered

Causes difficulty for current maintenance search tools

Page 4: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Challenges - Coding Practices

Identifier names important for code documentation and understanding

Problem: Programmers’ use of abbreviations in code– Frequency of occurrence

character, integer, string

– Complex inheritance – long class names SecureMessageServiceClientMessageImpl

Negates usefulness of identifier names and complicates program understanding

Page 5: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Abbreviations and Maintenance Tools

Problem: Search based maintenance tools rely on natural language

– Abbreviations change the natural language

Search Term: “distributed hash” dht = (DHTPlugin)dht_pi.getPlugin();

Thread t = new AEThread( "DHTTrackerPlugin:init" ) {

public void runSupport() { try{ if ( dht.isEnabled()){ log.log( "DDB

Available" ); } }

catch( Throwable e ){ log.log( "DDB Failed", e ); } ... }

}

Page 6: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Automatically Identifying Abbreviation Expansions

First, how do we identify candidates for expansion?– Non-dictionary words

Abbreviation– Short form

Expansion– Long form

Page 7: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Types of Non-Dictionary Words

Abbreviation Category

Type Short Form Long Form

Single WordPrefix int integer

Dropped Letter

evt event

Multiple Word

Acronym FBIFederal Bureau of Investigation

Combination Multiword

recblk receive block

Domain Keywords and Special Cases

---parsetree

serialize---

Page 8: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

State of the Art

Lawrie, Feild, and Binkley– Abbreviation Expansion– Problem:

Lack of precision No support for choosing between multiple matches

Page 9: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Scoped Approach

How to choose between multiple possible long forms:– By manual inspection we found correct

long forms are more likely to be found in certain locations

– Also, correctly identifying the long forms for certain types of abbreviations is easier than for others

Page 10: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Order of Types

Abbreviation Type

1: Acronym

2: Prefix

3: Dropped Letter

4: Combo Multiword

5: Most Common

Page 11: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Order of Program Context

Context1: Javadoc2: Type3: Method Name4: Statement5: Method6: Method Comments7: Class Comments

Page 12: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

General Algorithm

Javadoc

Type

Method Name

Acronym

PrefixJavadoc

Type

Method Name

Page 13: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Multiple matches

We assume one best candidate though multiple might be present at the same level of scope

If multiple matches:1. Examine frequencies

2. Stem long forms and reexamine frequencies

3. Broaden Scope and reexamine frequencies

4. Most frequent expansion

Page 14: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Most Frequent Expansion (MFE)

If still no ideal candidate is found:– We mined long forms from 1.5 million LOC of

Java 5 code base– Return most frequent long form as last resort

Page 15: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Evaluation of Scoped Approach

250 abbreviations from 5 subject programs Gold standard developed by human developer

inspecting the code manually Implemented LFB according to description

– Except combination words – due to missing database

(Accuracy)

Page 16: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Analysis and Refinement - iScope

Analyzed results and found 3 major sources of problems

Developed iScope by addressing these 3 major problem areas

Page 17: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Order of Scoping

•Insight: Context is more sensitive than type

•Solution: Check each type at each context level, then go to next context level (switch order)

•Problem:

•Scoped approach ordering: examine every context for an abbreviation type then go to next type

•Investigating broader contexts for one type before even the narrowest context for another type is likely to yield incorrect matches

Page 18: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Single Letter Abbreviations

•Insight: Based on manual inspection, we found that meaningful single letter short forms were identifiers whose long forms were also their type name

•Solution: Limit contextual scope to type only

•Problem:

•Developers use single letter abbreviations differently than multiple letter abbreviations

•A large subset are actually semantically meaningless

•Single letter very easily matched especially because prefix matching is greedy

•Reader r = new BufferedReader()

Page 19: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Hyper-Common Abbreviation

•Problem: Some abbreviations used so often in code that long form rarely ever co-occurs leading to incorrect expansion based on coincidence

•Solution: Mine a small set of extremely common abbreviations and use as a preprocessing step

Page 20: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Mined list of hyper-common abbreviations

Page 21: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Evaluations

Is our method accurate enough to be useful?– Reevaluation of previous experiment

Does abbreviation expansion help maintenance tasks?– Simple Search– Concern Location Task

Page 22: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

1. Reevaluation of Previous Test

Based on our previous experimental methodology and metrics, how much improvement was made from Scope to iScope?

Modified goldset based on new assumptions – single letter abbreviations

Page 23: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

1. Reevaluation of Previous Test - Results

•Compare LFB with Scope and iScope using non combinational word (NCW) accuracy values

•Compare JavaMFE, ProgMFE, Scope, and iScope using the total accuracy values

Page 24: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

2. Simple Search Evaluation

When abbreviations are expanded in software, how many more search results are returned than without expansion?

Focus: Recall– Not missing important results – want as many

potentially relevant results as possible Metric: Percent increase in results

– P.I. = Raw returned results with expansion - 100%

Raw returned results without expansion

Page 25: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

2. Simple Search Evaluation (cont)

Subjects: 215 concerns(Eaddy et al.) annotated by 3 people each for total of 645 queries– Developed independent of the idea of

abbreviation expansion – many queries might not be affected by abbreviation expansion at all

“Match”: if any word in the query matches any word in the method considered a match and returned as a result

Page 26: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

2. Simple Search Evaluation - Results

Approach Total Returned Results

Percent Increase

No Expansion 240,752 ---

Scope 284,160 18.03

iScope 282,489 17.34

•Less increase with iScope – single letter abbreviation false positive decrease•Ideally, this means quality is better

•experiment 3

Page 27: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

3. Evaluation with Concern Location

Concern location task: identification of methods that are deemed to be relevant for the given search term

How much increase in effectiveness can be gained from expanding abbreviations in source code when performing concern location tasks?

Page 28: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

3. Evaluation Methodology

Tools: Latent Semantic Indexing(LSI) and Log Entropy-based concern location– Goals: Attempt to calculate similarity values

based on location and frequency of potential query matches

Subjects: same as previous experiment

Page 29: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

3. Methodology (cont)

Metric: Mean Average Precision (MAP)– Precision: # True positives / Total # of positives– MAP:

Collect precision values for every new true positive, going down the ranked returned results

Then take average of all results

– Attempts to reward highly ranked true positives

Page 30: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

3. Concern Location Tasks - Results

Page 31: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

3. Concern Location Tasks - Results

Page 32: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Conclusions

Abbreviation expansion is proven to be helpful in maintenance tools and processes

iScope approach improves upon Scope and greatly upon state-of-the-art

Page 33: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Future Work

Further refinement of expansion process to achieve highest possible accuracy

Full integration into maintenance tool Extension into other programming languages

Page 34: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Acknowledgments

Emily Hill and Haley Boyd Dr. Vijay K. Shanker and Dr. Lori Pollock

Page 35: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Questions?

Page 36: Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools

Inherent Inaccuracy

•Problem: Additional errors in code not generalizable into solvable problems

•Insight: There will always be inherent error when developing automatic systems for non-standard input