36
BOA: A LANGUAGE AND INFRASTRUCTURE FOR ANALYZING ULTRA-LARGE-SCALE SOFTWARE REPOSITORIES Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallne 1 Joshan Valayil John | The University of Texas at Arlington | CSE 6324

Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Embed Size (px)

Citation preview

Page 1: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

BOA: A LANGUAGE AND INFRASTRUCTURE FOR ANALYZING ULTRA-LARGE-SCALE

SOFTWARE REPOSITORIES

Presenter: Joshan V John

Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. NguyenIowa State University, USA

Instructor: Christoph Csallner

1Joshan Valayil John | The University of Texas at Arlington | CSE 6324

Page 2: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Agenda

Motivation

Ultra-large-scale software repositories

Barriers to mining software repositories

Solution - Boa

Goals of Boa

Boa Architecture

EvaluationJoshan Valayil John | The University of Texas at Arlington | CSE 6324 2

Page 3: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Motivation

Big-3 software repositories known to have close to 1 million projects.

Contains a wealth of software and information about software.

Systematic extraction of relevant data from these repositories and their analysis for testing hypotheses is hard.

Boa, a domain-specific language and infrastructure, developed to ease testing ‘Mining Software Repository’ related hypotheses.

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 3

Page 4: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Ultra-large-scale Software Repositories

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 4

Page 5: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Why analyze software repositories?

Curiosity

Identify patterns

Forecasting

Plan for better designs

Empirical Validation

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 5

Page 6: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Barriers to mining software repositories

Develop programming expertise to access version control system.

Establish infrastructure to store downloaded data from software repositories.

Joshan Valayil John | The University of Texas at Arlington | CSE 6324

Develop an application to access this local data. Improve scalability of analysis infrastructure to

process ultra-large-scale data.

6

Page 7: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Barriers to mining software repositories

Experiments are often irreproducible

Low reusability of experimental infrastructure

Lack of systematic curation leads to loss of

experimental data.

Building analysis infrastructure to process ultra-

large-scale data efficiently can be very hard.

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 7

Page 8: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Solution - Boa

Designed a domain specific language and

infrastructure to analyze ultra-large-scale

software repositories – Boa.

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 8

Page 9: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Goals of Boa

Easy to use

Better abstractions

Efficient & Scalable

Enhances reproducibility

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 9

Page 10: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

A Research Question

Consider a program that answers:

“What are the churn rates for all Java projects

that use SVN?”

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 10

Page 11: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Solution in Java Full program over 70

lines of code. Uses JSON and SVN

libraries. Runs sequentially. Takes over 24 hours. Takes almost 3 hours

with data locally cached.

Can be parallelized, but very complex.

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 11

Page 12: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Solution in Boa

Joshan Valayil John | The University of Texas at Arlington | CSE 6324

Simple program, 6 lines of code. Hides implementation specifics. Auto parallelization, results in 1 minute. Results can be easily reproduced by publishing these

small programs with the data sets used.12

Page 13: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Performance Results

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 13

Page 14: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Boa Architecture

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 14

Page 15: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Boa Architecture

Three main components

The Boa Language

Boa Compiler & Runtime

Supporting data infrastructure

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 15

Page 16: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

The Boa Language

Domain-Specific Types

MapReduce Support

Quantifiers

User defined functions

Output Aggregators

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 16

Page 17: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Boa Language – Domain-Specific Types

Provides several domain-specific types which aid in abstracting mining software repository details (http://boa.cs.iastate.edu/docs/dsl-types.php)

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 17

Page 18: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Boa Language – MapReduce Support

Computations specified via two user-defined functions: Mapper – takes key-value pairs as input &

produces key-value pairs as output.

Reducer – Consumes the above output and aggregates data based on individual keys.

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 18

Page 19: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Boa Language – Quantifiers

Boa defines the quantifiers: exists foreach ifall

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 19

Page 20: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Boa Language – User-Defined Functions

Users can define their own mining algorithms

Facilitates code re-use.

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 20

Page 21: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Boa Language – Output aggregators

Joshan Valayil John | The University of Texas at Arlington | CSE 6324

Output can be indexed

Output defined in terms of predefined data aggregators

21

Page 22: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Boa’s Supporting Infrastructure

Compiler & Runtime

Data Infrastructure

Web based interface

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 22

Page 23: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Boa’s Compiler & Runtime

Initial implementation was based upon the Sizzle compiler & framework

Sizzle is an open-source Java implementation of the Sawzall language.

Sizzle provides support for generating programs that run on the Hadoop open-source MapReduce framework.

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 23

Page 24: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Boa’s Data Infrastructure

Local cache of repository information. First Step – Locally replicate data.

Second Step – Run the caching translator to convert data into the framework required format.

Input (JSON file + SVN repositories) -> Output (Hadoop SequenceFile)

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 24

Page 25: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Boa’s Web based Interface

Submit programs.

Compile & run them on their clusters.

Each submission creates a job in the system.

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 25

Page 26: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Evaluation

Programs were executed on a Hadoop 1.0.3

install.

Cluster was not tuned for performance, except

for setting the maximum number of map tasks

for each compute node equal to the number of

cores on that node and increasing the VM heap

size.Joshan Valayil John | The University of Texas at Arlington | CSE 6324 26

Page 27: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Evaluation – Applicability

Research Question 1 – Does Boa help researchers analyze ultra-large-scale software repositories?

A set of 21 tasks in four different categories were examined. Programming Languages Project Management Legal Platform/Environment

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 27

Page 28: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 28

Page 29: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Evaluation - Applicability

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 29

Page 30: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Evaluation - Scalability

Research Question 2 – Does the approach scale to the size of the cluster?

Research Question 3 – Does the approach scale with the size of the input?

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 30

Page 31: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Evaluation - Scalability

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 31

Page 32: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Evaluation - Scalability

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 32

Page 33: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Evaluation - Reproducibility

Research Question 4 – Using their infrastructure, can researchers easily reproduce previously published results?

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 33

Page 34: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Evaluation - Reproducibility

Conducted controlled experiment Selected group of 8 researchers Each chose 3 tasks

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 34

Page 35: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

References

http://design.cs.iastate.edu/papers/ICSE-13/icse13.pdf

http://boa.cs.iastate.edu/docs/

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 35

Page 36: Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan

Joshan Valayil John | The University of Texas at Arlington | CSE 6324 36