Upload
helen-walker
View
220
Download
3
Embed Size (px)
Citation preview
BOA: A LANGUAGE AND INFRASTRUCTURE FOR ANALYZING ULTRA-LARGE-SCALE
SOFTWARE REPOSITORIES
Presenter: Joshan V John
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. NguyenIowa State University, USA
Instructor: Christoph Csallner
1Joshan Valayil John | The University of Texas at Arlington | CSE 6324
Agenda
Motivation
Ultra-large-scale software repositories
Barriers to mining software repositories
Solution - Boa
Goals of Boa
Boa Architecture
EvaluationJoshan Valayil John | The University of Texas at Arlington | CSE 6324 2
Motivation
Big-3 software repositories known to have close to 1 million projects.
Contains a wealth of software and information about software.
Systematic extraction of relevant data from these repositories and their analysis for testing hypotheses is hard.
Boa, a domain-specific language and infrastructure, developed to ease testing ‘Mining Software Repository’ related hypotheses.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 3
Ultra-large-scale Software Repositories
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 4
Why analyze software repositories?
Curiosity
Identify patterns
Forecasting
Plan for better designs
Empirical Validation
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 5
Barriers to mining software repositories
Develop programming expertise to access version control system.
Establish infrastructure to store downloaded data from software repositories.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
Develop an application to access this local data. Improve scalability of analysis infrastructure to
process ultra-large-scale data.
6
Barriers to mining software repositories
Experiments are often irreproducible
Low reusability of experimental infrastructure
Lack of systematic curation leads to loss of
experimental data.
Building analysis infrastructure to process ultra-
large-scale data efficiently can be very hard.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 7
Solution - Boa
Designed a domain specific language and
infrastructure to analyze ultra-large-scale
software repositories – Boa.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 8
Goals of Boa
Easy to use
Better abstractions
Efficient & Scalable
Enhances reproducibility
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 9
A Research Question
Consider a program that answers:
“What are the churn rates for all Java projects
that use SVN?”
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 10
Solution in Java Full program over 70
lines of code. Uses JSON and SVN
libraries. Runs sequentially. Takes over 24 hours. Takes almost 3 hours
with data locally cached.
Can be parallelized, but very complex.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 11
Solution in Boa
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
Simple program, 6 lines of code. Hides implementation specifics. Auto parallelization, results in 1 minute. Results can be easily reproduced by publishing these
small programs with the data sets used.12
Performance Results
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 13
Boa Architecture
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 14
Boa Architecture
Three main components
The Boa Language
Boa Compiler & Runtime
Supporting data infrastructure
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 15
The Boa Language
Domain-Specific Types
MapReduce Support
Quantifiers
User defined functions
Output Aggregators
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 16
Boa Language – Domain-Specific Types
Provides several domain-specific types which aid in abstracting mining software repository details (http://boa.cs.iastate.edu/docs/dsl-types.php)
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 17
Boa Language – MapReduce Support
Computations specified via two user-defined functions: Mapper – takes key-value pairs as input &
produces key-value pairs as output.
Reducer – Consumes the above output and aggregates data based on individual keys.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 18
Boa Language – Quantifiers
Boa defines the quantifiers: exists foreach ifall
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 19
Boa Language – User-Defined Functions
Users can define their own mining algorithms
Facilitates code re-use.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 20
Boa Language – Output aggregators
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
Output can be indexed
Output defined in terms of predefined data aggregators
21
Boa’s Supporting Infrastructure
Compiler & Runtime
Data Infrastructure
Web based interface
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 22
Boa’s Compiler & Runtime
Initial implementation was based upon the Sizzle compiler & framework
Sizzle is an open-source Java implementation of the Sawzall language.
Sizzle provides support for generating programs that run on the Hadoop open-source MapReduce framework.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 23
Boa’s Data Infrastructure
Local cache of repository information. First Step – Locally replicate data.
Second Step – Run the caching translator to convert data into the framework required format.
Input (JSON file + SVN repositories) -> Output (Hadoop SequenceFile)
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 24
Boa’s Web based Interface
Submit programs.
Compile & run them on their clusters.
Each submission creates a job in the system.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 25
Evaluation
Programs were executed on a Hadoop 1.0.3
install.
Cluster was not tuned for performance, except
for setting the maximum number of map tasks
for each compute node equal to the number of
cores on that node and increasing the VM heap
size.Joshan Valayil John | The University of Texas at Arlington | CSE 6324 26
Evaluation – Applicability
Research Question 1 – Does Boa help researchers analyze ultra-large-scale software repositories?
A set of 21 tasks in four different categories were examined. Programming Languages Project Management Legal Platform/Environment
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 27
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 28
Evaluation - Applicability
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 29
Evaluation - Scalability
Research Question 2 – Does the approach scale to the size of the cluster?
Research Question 3 – Does the approach scale with the size of the input?
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 30
Evaluation - Scalability
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 31
Evaluation - Scalability
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 32
Evaluation - Reproducibility
Research Question 4 – Using their infrastructure, can researchers easily reproduce previously published results?
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 33
Evaluation - Reproducibility
Conducted controlled experiment Selected group of 8 researchers Each chose 3 tasks
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 34
References
http://design.cs.iastate.edu/papers/ICSE-13/icse13.pdf
http://boa.cs.iastate.edu/docs/
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 35
Joshan Valayil John | The University of Texas at Arlington | CSE 6324 36