41
Developing Natural Language- based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill, GiriprasadSridhara Past collaborators: David Shepherd,

Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Embed Size (px)

Citation preview

Page 1: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Developing Natural Language-based

Software Analyses and Toolsto Expedite Software Maintenance

Lori PollockCollaborators: K. Vijay-Shanker, Emily Hill,

GiriprasadSridharaPast collaborators: David Shepherd, Zachary P.

Fry, KishenMaloor

Page 2: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

ProblemModern software is large and complex

object oriented class hierarchy

Softwaremaintenance:

- search/locate

- navigate

- understand

- modify

-> Automated Support

Page 3: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Successesin Software Maintenance Tools

object oriented class hierarchy

Good with local tasks

Good with traditional structure

Page 4: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

object oriented class hierarchy

Scattered tasks are difficult

Programmers use more than traditional program structure

Challengesin Software Maintenance Tools

Page 5: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

public interface Storable{...

activate tool

save drawing

update drawing

undo action

public void Circle.save()

//Store the fields in a file....

object oriented system

Key Insight:Programmers leave natural language clues that

can benefit softwaremaintenance tools

Observations

Page 6: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Studies on choosing identifiers

Impact of human cognition on names [Liblit et al. PPIG 06] Metaphors, morphology, scope, part of speech hints Hints for understanding code

Analysis of Function identifiers [Caprile and Tonella WCRE 99] Lexical, syntactic, semantic analysis Use for software tools: metrics, traceability, program

understanding

Carla, the compiler writer Pete, the programmer

I don’t care about names.

So, I could use x, y, z. But, no one

will understandmy code.

Page 7: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

NLPA

Our Research Focus and Impact

Software Maintenance Tools

SearchSearch UnderstandingUnderstanding ……

Natural Language Analysis

Word relations(synonyms, antonyms, …

Word relations(synonyms, antonyms, …

Part of speech tagging

Part of speech tagging ……AbbreviationsAbbreviations

ExplorationExploration

Page 8: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Our Research Contributions…

FindConceptConcern location tool

Clue Extraction +NL-based Program

Representation

Motivated useof NL clues

during maintenance

Dora theProgram Explorer

AbbreviationExpander

Word Relation ToolComparison Study

[MACS 05, LATE 05] [AOSD 06, IET 08]

[AOSD 07, PASTE 07][ASE 05]

iTimnaAspect Miner

[ASE 07]

[MSR 08] [ICPC 08]

Page 9: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Automatic Natural Language Clue Extraction from Source Code

Key Challenges:Decode name usageDevelop automatic NL clue

extraction process (focused on Java)

Create NL-based program representation

Molly, the Maintainer

What was Pete thinking

when he wrote this code?

Page 10: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Natural Language: Which Clues to Use?

Software MaintenanceTypically focused on actionsObjects are well-modularized

Focus on actions Correspond to verbsVerbs need Direct Object

(DO)

Extract verb-DO pairs

Page 11: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Extracting Verb-DO Pairs

Two types of extractionclass Player{ /** * Play a specified file with specified time interval */ public static boolean play(final File file,final float fPosition,final long length) { fCurrent = file; try { playerImpl = null; //make sure to stop non-fading players stop(false); //Choose the player Class cPlayer = file.getTrack().getType().getPlayerImpl(); …}

Extraction from comments

Extraction from method signatures

Page 12: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

public UserList getUserListFromFile( String path ) throws IOException {

try {

File tmpFile = new File( path );

return parseFile(tmpFile);

} catch( java.io.IOException e ) {

thrownew IOrException( ”UserList format issue" + path + " file " + e );

}

}

Extracting Clues from Signatures

1. Part-of-speech tag method name

2. Chunk method name

3. Identify Verb and Direct-Object (DO)

get<verb> User<adj> List<noun>From <prep>File <noun>

get<verb phrase> User List<noun phrase>FromFile <prep phrase>

POS Tag

Chunk

Page 13: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Representing Verb-DO PairsAction-Oriented Identifier Graph

(AOIG)verb1 verb2 verb3 DO1 DO2 DO3

verb1, DO1 verb1, DO2 verb3, DO2 verb2, DO3

source code files

use

use

use

use

use

use

useuse

Page 14: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Action-Oriented Identifier Graph (AOIG)

Example

play add remove file playlist listener

play, file play, playlist remove, playlist add, listener

source code files

use

use

use

use

use

use

useuse

Page 15: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Evaluation of Clue Extraction

Compared automatic vs ideal (human) extraction 300 methods from 6 medium open source programs Annotated by 3 Java developers

Promising Results Precision: 57% Recall: 64%

Context of Results Did not analyze trivial methods On average, at least verb OR direct object obtained

Page 16: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Using AOIG in Concern Location

Find, collect, and understand all source code related to a particular concept

Concerns are often crosscutting

Page 17: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

State of the Art for Concern Location

Mining Dynamic Information [Wilde ICSM 00]

Program Structure Navigation [Robillard FSE 05,

FEAT, Schaefer ICSM 05]

Search-Based Approaches RegExp[grep, Aspect Mining Tool 00]

LSA-Based [Marcus 04]

Word-Frequency Based [GES 06]

Reduced to similar problem

Slow

Fast

Fragile

Sensitive

No Semantics

Page 18: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Limitations of Search Techniques

1. Return large result sets2. Return irrelevant

results3. Return hard-to-

interpret result sets

Page 19: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Find-Concept Search Tool

concept

Find-Concept query

Recommendations

Source Code

Method a

Method bMethod c

Method d Method e

NL-basedCode Rep

Result GraphNatural

Language Information

1. More effective search

2. Improved search terms

3. Understandable results

Page 20: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Underlying Program Analysis

Word Recommendation AlgorithmStemmed/Rooted: complete,

completingSynonyms: finish, completeCo-location: completeWord()Uses traversals of Action-oriented

identifier graph (AOIG)

Page 21: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Experimental Evaluation

Research Questions Which search tool is most effective at forming

and executing a query for concern location? Which search tool requires the least human

effort to form an effective query?

Methodology:18 developers completenine concern locationtaskson

medium-sized (>20KLOC) programs

Measures:Precision (quality), Recall (completeness),

F-Measure (combination of both P & R)

Find Concept, GES, ELex

Page 22: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Overall Results

Effectiveness FC > Elex with statistical

significance FC >= GES on 7/9 tasks FC is more consistent than

GES Effort

FC = Elex = GES

FC is more consistent and more effective in experimental study without requiring more effort

Across all tasks

Page 23: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

NLPA

Our Research Focus and Impact

Software Maintenance Tools

SearchSearch UnderstandingUnderstanding ……

Natural Language Analysis

Word relations(synonyms, antonyms, …

Word relations(synonyms, antonyms, …

Part of speech tagging

Part of speech tagging ……AbbreviationsAbbreviations

ExplorationExploration

Page 24: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Dora the Program Explorer*

* Dora comes from exploradora, the Spanish word for a female explorer.

DoraDora

Natural Language Query• Maintenance request• Expert knowledge• Query expansion

Natural Language Query• Maintenance request• Expert knowledge• Query expansion

Relevant Neighborhood

Program Structure• Representation

• Current: call graph• Seed starting point

Relevant Neighborhood• Subgraph relevant to query

Query

Page 25: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

State of the Art in Exploration

Structural (dependence, inheritance) Slicing Suade [Robillard 2005]

Lexical (identifier names, comments) Regular expressions: grep, Eclipse search Information Retrieval: FindConcept [Shepherd

2007], Google Eclipse Search [Poshyvanyk 2006]

Page 26: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Dora: Using Program Structure + Ids

Program: JBidWatcher, an eBay auction sniping program

Bug: User-triggered add auction event has no effect

Task: Locate code related to ‘add auction’ trigger Seed: DoAction() method, from prior knowledge

Example Scenario:

Key Insight: Automated tools can use program structure and identifier names to

save the developer time and effort

Page 27: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

DoNada() DoNada() DoNada() DoNada() DoNada()DoNada() DoNada()DoNada()DoNada() DoNada() DoNada()

DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

DoNada() DoNada()DoNada() DoNada() DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

Using only structural information

DoAction() has 38 callees, only 2/38 are relevant Relevant

Methods

Irrelevant Methods

Looking for: ‘add auction’ trigger

DoAction()

DoAdd()

DoPasteFromClipboard()

And what if you wanted to explore more than one edge away?

Locates locally relevant items, but many irrelevant

Page 28: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Using only lexical information

50/1812 methods contain matches to ‘add*auction’ regular expression query

Only 2/50 are relevant

Locates globally relevant items, but many irrelevant

Page 29: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

DoNada() DoNada() DoNada() DoNada() DoNada()DoNada() DoNada()DoNada()DoNada() DoNada() DoNada()

DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

DoNada() DoNada()DoNada() DoNada() DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

Combining Structural &Lexical Information

Structural: guides exploration from seed

Looking for: ‘add auction’ trigger

RelevantNeighborhood

DoAction()

DoPasteFromClipboard()

DoAdd()

Lexical: prunes irrelevant edges

Page 30: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

The Dora Approach

Determine method relevance to queryCalculate lexical-based relevance score

Prune low-scored methods from neighborhood

Recursively explore

Prune irrelevant structural edges from seed

Page 31: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

31

Evaluation of the Dora Approach

Evaluated on 9 concerns Lexical + structural >

structural However, success

highly dependent on lexical scoring performance

Structureonly

Structureonly

Page 32: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

NLPA

Our Research Focus and Impact

Software Maintenance Tools

SearchSearch UnderstandingUnderstanding ……

Natural Language Analysis

Word relations(synonyms, antonyms, …

Word relations(synonyms, antonyms, …

Part of speech tagging

Part of speech tagging ……AbbreviationsAbbreviations

ExplorationExploration

Page 33: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Automatic Abbreviation Expansion

1.Split Identifiers: Punctuation Camel case No boundary

e.g., strlen

2.Identify non-dictionary words

3.Determine long form

non-dictionary wordnon-dictionary word

no boundaryno boundary

• Don’t want to miss relevant code with abbreviations

• Given a code segment, identify character sequences that are short forms and determine long form

•Approach: Mine expansions from code [MSR 08]

Page 34: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Simple Dictionary Approach

Manually create a lookup table of common abbreviations in code

- Vocabulary evolves over time, must maintain table

- Same abbreviation can have different expansions depending on domain AND context:

cfg?Control Flow Graph

Context-Free Grammar

configuration

configure

Page 35: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Types of Non-Dictionary Words

Single-WordPrefix (attr, obj, param, i)Dropped Letter (src, evt, msg)

Multi-WordAcronyms (ftp, xml, [type names])Combination (println, doctype)

OthersNo boundary (saveas, filesize)Misspelling (instanciation, zzzcatzzzdogzzz)

Page 36: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Long Form Search Patterns

Given short form arg, we search for regular expressions matching long forms in code:Single-Word

PrefixargumentDropped letteraverage

Multi-WordAcronym attribute random groupCombinationaccess rights

Page 37: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Search Pattern Order

Search by abbreviation type:

AcronymAcronymAcronymAcronym

CombinationCombinationCombinationCombination

PrefixPrefixPrefixPrefix

Dropped Dropped LetterLetter

Dropped Dropped LetterLetter

Multi-Word Single Word

Conservative

GreedyGreedy

How do we identify potential long forms for each type?

Page 38: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Inspired by static scoping, start from method containing abbreviation and search increasingly broader “scopes” until clear winner:

1.JavaDoc

2.Type Names of declared variables

3.Method Name

4.Statements5.Referenced identifiers and string literals

6.Method comments

7.Class comments

Context-based Approach through Scope

Page 39: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

What if no long form found?

Fall back to Most Frequent Expansion (MFE)

MFE leverages successful local expansions and applies throughout the program1.Program: provides domain knowledge2.Java: more general programming

knowledge

Page 40: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

Scope 57% more accurate than state of the art LFB

Scope 30% more accurate than Java MFE

Program MFE acceptable approximation when speed more important than accuracy

Experimental EvaluationN

umbe

r of

Cor

rect

Exp

ans

ions

250

200

150

100

50

0NoExp

LFB JavaMFE

ProgMFE

OurScope

Accuracy: 22% 40% 45% 54% 59%

63%

Page 41: Developing Natural Language-based Software Analyses and Tools to Expedite Software Maintenance Lori Pollock Collaborators: K. Vijay-Shanker, Emily Hill,

In Conclusion…

Evaluation studies indicateNatural language analysis has far more potential to improve software maintenance tools than we initially believed

Existing technology falls shortSynonyms, collocations, morphology, word

frequencies, part-of-speech tagging, AOIG

Keys to further successImprove recall Extract additional NL clues