32
International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit Simionovici

International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

Embed Size (px)

Citation preview

Page 1: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

International Conference on Program Comprehension (ICPC) 2008

A Traceability Technique for Specifications

Aharon Abadi, Mordechai Nisenson and Yahalomit Simionovici

Page 2: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

2 A Comparison of Traceability Techniques for Specifications

Outline

Motivation

Goals

Our Solution: Outline of Traceability Link Process

IR Techniques

Experiments

Conclusions

Future work

Page 3: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

3 A Comparison of Traceability Techniques for Specifications

Traceability

The ability to link between different artifacts

– Example artifacts: code, user manuals, design documentation, development wikis, etc.

In particular, link code to:

– Relevant requirements

– Sections in design documents

– Test-cases

– Other structured and free-text artifacts

Also, link from requirements, design documents, etc. to code

Page 4: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

4 A Comparison of Traceability Techniques for Specifications

What’s Traceability Good For?

Program Comprehension

– Top-down

– Bottom-up

• Particularly relevant for the maintenance of legacy systems

Impact analysis

– Keeping non-code artifacts up-to-date

Requirement Tracing

– Discover what code needs to change to handle a new req.

– Aid in determining whether a specification is completely implemented and covered by tests

Page 5: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

5 A Comparison of Traceability Techniques for Specifications

Challenges

Scalability

– Large # of artifacts

Heterogeneity

– Large # of different document formats and programming languages

Noisy

– Free text information (natural language): conjuctions, prepositions, abbreviations, etc.

– Some information may be outdated, or just plain wrong

Prior work:

– Recovering Traceability Links in Software Artifact Management Systems using information retrieval methods [Lucia et al., 2007]

– Recovering Traceability Links between Code and Documentation [Antoniol et al., 2002, Deerwester et al., 1990, Marcus and Maletic, 2003]

Page 6: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

6 A Comparison of Traceability Techniques for Specifications

Outline

Motivation

Goals

Our Solution: Outline of Traceability Link Process

IR Techniques

Experiments

Conclusions

Future work

Page 7: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

7 A Comparison of Traceability Techniques for Specifications

Example

/** The File interface provides…*/public class FileImpl extends FilePOA{ private String nativefileName;

/** * Creates a new File… */ public FileImpl(String nativePath ...){ … }

/** *… */

Private String f(..){…} }

Page 8: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

8 A Comparison of Traceability Techniques for Specifications

Goals

Examine the effectiveness of IR techniques for traceability between code and documentation on “real world” data

Most prior work compared 2 specific algorithms, LSI and VSM

– Is LSI really better?

– How does LSI stack up with other dimensionality reduction techniques?

– How does it compare with other non-dimensionality reduction techniques?

How do different levels of abstraction affect the choice of the best methods?

– How to fit a method and parameters to a dataset?

Page 9: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

9 A Comparison of Traceability Techniques for Specifications

Outline

Motivation

Goals

Our Solution: Outline of Traceability Link Process

IR Techniques

Experiments

Conclusions

Future work

Page 10: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

10 A Comparison of Traceability Techniques for Specifications

Traceability Link Process

TextPreprocessing

Sectoring

Document Pre-processing IR-Index

Words

expansion words

extraction

Query Construction

Words

ranking

documents sections sections

sections

Off line processes

partialcode

(word1,rank1),…,(wordm,rankm)

sections

TextPreprocessing

(word1,rank1),…,(wordm,rankm)

Page 11: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

11 A Comparison of Traceability Techniques for Specifications

Text Preprocessing

TextPreprocessing

…Copyright owners grant member companies of the OMG permission to make a limited …

…copyright owner grant member companiomg permissmake limit …

• Lower-case , stop-words, number etc. • Stemming

Page 12: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

12 A Comparison of Traceability Techniques for Specifications

/** The File interface provides…*/public class FileImpl extends FilePOA{ private String nativefileName;

/** * Creates a new File… */ public FileImpl(String nativePath ...){ … }

/** *… */

Private String f(..){…} }

Words Extraction

words extraction

FileImpl

• Class Name• Public Function names• Public function arguments and return type• Comments• Super class name

FileImpl nativePath

FilePOA

Creates a new File…

The File interface provides…

Page 13: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

13 A Comparison of Traceability Techniques for Specifications

Words Expansion

Words

expansion …NativePath, fileName, delete_all_elements…

… NativePath,Native,Path, fileName,File,Name, delete_all_elements,Delete,all,elements …

• Use well-known coding standards for sub-words separation

Page 14: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

14 A Comparison of Traceability Techniques for Specifications

Outline

Motivation

Goals

Our Solution: Outline of Traceability Link Process

IR Techniques

Experiments

Conclusions

Future work

Page 15: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

15 A Comparison of Traceability Techniques for Specifications

Information Retrieval (IR) Methods

Vector Space Model (VSM) [Salton et al., 1975] implemented by Lucene

– Each document, d, is represented by a vector of ranks of the terms in the vocabulary:

vd = [rd(w1), rd(w2), …, rd(w|V|)]

– The query is similarly represented by a vector

– The similarity between the query and document is the cosine of the angle between their respective vectors

Jensen Shannon Similarity Model [Abadi et al., 2008]

– Each document, d, is represented by its empirical probability distribution over words: pd(w)

– The query is similarly represented

– The similarity score is calculated as 1 – JS(pq, pd), where JS is the Jensen-Shannon Divergence

Page 16: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

16 A Comparison of Traceability Techniques for Specifications

Dimensionality Reduction Methods

LSI [Deerwester et al., 1990]

– Commonly used in prior studies

– An algebraic method

– Dimensions represent orthogonal topics

PLSI [Hofmann, 1999]

– Probabilistic extension to LSI

– Based on the assumption that documents are mixtures of topics distributions

– Words and documents are conditionally independent given the topic

SDR [Globerson and Tishby, 2003]

– Based on information theory

– Topics are sufficient statistics in information theory terms

– These statistics are functions that capture maximum mutual information between words and documents

Page 17: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

17 A Comparison of Traceability Techniques for Specifications

Outline

Motivation

Goals

Our Solution: Outline of Traceability Link Process

IR Techniques

Experiments

Conclusions

Future work

Page 18: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

18 A Comparison of Traceability Techniques for Specifications

Datasets

Software Communication Architecture (SCA) is an open architecture framework that defines how software and hardware elements operate within a software defined radio.

Common Object Request Broker Architecture (CORBA) is OMG's open, vendor-independent architecture and infrastructure that computer applications use to work together over networks.

DatasetSize (MB)SectionsVocabulary size

SCA0.411311 4827

CORBA1.793340 7161

Documentation details:

Queries details: Dataset# classes# relevant results / query

Total # of relevant results

SCA76 – 1365

CORBA45 – 20 58

Page 19: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

19 A Comparison of Traceability Techniques for Specifications

IR Quality Measures

Precision @ n:

Recall @ n:

Average precision:

n

retrievedrelevantnP

)(

relevant

retrievedrelevantnR

)(

relevant

nrelnPAP

N

n

1

)()(

Page 20: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

20 A Comparison of Traceability Techniques for Specifications

MAP versus Method

Page 21: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

21 A Comparison of Traceability Techniques for Specifications

Mean Average Precision (MAP) versus Dimension

Page 22: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

22 A Comparison of Traceability Techniques for Specifications

Precision versus Recall

Page 23: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

23 A Comparison of Traceability Techniques for Specifications

Dimensionality of Datasets

SCA CORBA

PLSI Results

Page 24: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

24 A Comparison of Traceability Techniques for Specifications

Precision versus Recall over Algorithms for SCA

Page 25: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

25 A Comparison of Traceability Techniques for Specifications

Precision versus Recall over Algorithms for CORBA

Page 26: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

26 A Comparison of Traceability Techniques for Specifications

MAP versus Method – Combined over SCA & CORBA

Page 27: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

27 A Comparison of Traceability Techniques for Specifications

Outline

Motivation

Our Solution: Outline of Traceability Link Process

Similarity measures

IR Techniques

IR Quality Measures

Experiments

Conclusions

Future work

Page 28: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

28 A Comparison of Traceability Techniques for Specifications

Conclusions

Our Most significant results are:

– Traceability between code and documentation in real world systems is effective via IR techniques.

– For realistic datasets the Vector Space Model and Jensen Shannon model, which did not perform dimensionality reduction where shown to be the most effective.

– SDR was shown to be the best dimensionality reduction model, specifically it is better then LSI.

– As the documentation links are more abstract, the performance of VSM, JS model and SDR become equivalent.

Additional results:

– SDR was shown to be robust to datasets abstractness level

– LSI and PLSI are sensitive to datasets abstractness level

– We believe that PLSI poor performance is due to the difficulty of modeling very short documents, which could result in severe overfitting

Page 29: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

29 A Comparison of Traceability Techniques for Specifications

Outline

Motivation

Our Solution: Outline of Traceability Link Process

Similarity measures

IR Techniques

IR Quality Measures

Experiments

Conclusions

Future work

Page 30: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

30 A Comparison of Traceability Techniques for Specifications

Future work

Development of new measures for evaluation of different IR algorithms and datasets, specifically for traceability

– Example: developing a measure of “abstractness” for a specification which will help with tuning of parameters such as dimensionality

Using dimensionality reduction techniques for creating thesaurus from the indexed data and using it for adding synonyms to the query

Traceability for other types of documents and links

Investigate alternative methods for query construction

Page 31: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

31 A Comparison of Traceability Techniques for Specifications

References

A.D. Lucia, F.Fasano, R. Oliveto, and G. Tortora. Recovering Traceability Links in Software Artifact Management Systems using Information Retrieval Methods. ACM Trans. Softw. Eng. Methodol., 16(4):13, 2007.

G. Antoniol, G. Canfora, G. Casazza, A.D. Lucia, and E. Merlo. Recovering Traceability Links Between Code and Documentation. IEEE Trans. Softw. Eng. , 28(10):970-983, 2002.

S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990.

A. Marcus and J. I. Maletic. Recovering Documentation to Source Code Traceability Links using Latent Semantic Indexing. In ICSE ’03: Proceedings of the 25th International Conference on Software Engineering , 125-135, 2003.

G.Salton, A. Wong, and C.S. Yang. A Vector Space Model for Automatic Indexing. Commun. ACM, 18(11):613-620, 1975.

T.Hofmann, Probabilistic Latent Semantic Indexing. In SIGIR, 50-57, 1999.

A. Globerson and N. Tishby. Sufficient Dimensionality Reduction. Journal of Machine Learning Research, 3:1307-1331, 2003.

Page 32: International Conference on Program Comprehension (ICPC) 2008 A Traceability Technique for Specifications Aharon Abadi, Mordechai Nisenson and Yahalomit

ICPC 2008

32 A Comparison of Traceability Techniques for Specifications

Thank You!