Using Natural Language Processing on large Github data

Preview:

Citation preview

- John Alexander and Harshitha ChidanandaCS273: Data and Knowledge bases

UCSB, Fall 2016

Outline

● Github

● Knowledge base

● Data

● Goals

● Vector Space Model

● Topic Modeling

● Gitsmart Demo

● Conclusion

Our knowledge baseThe aim of the project is to use existing data in large well-developed GitHub repositories to build a knowledge base that can assist in the continued development of the project and in more quickly resolving issues posted to the repository.

Problem:

Currently, issues and pull requests for large projects are manually curated.

Proposed solution:

By providing a queryable, largely unsupervised means of tracking input from developers and users, we can significantly improve the efficiency of project curators.

Goals:

- Use Natural Language Processing to get meaningful data from text in the repository- Find similar issues- Recommend people who could work on the issues in the repository based on their previous work- Draw relationship between issues, pull requests and users

Data

Vector Space Modeling

Vector Space Model- Also called as ‘term vector

model’ or ‘vector processing model’

- Represents documents by term sets

- Compares global similarities between queries and documents used in information filtering, information retrieval, indexing and relevancy rankings

1. Github issues data from GithubAPI

Organization: Facebook

Repo: rockdb

Specifically: Number, title and body

2. Preprocess data

Remove stop words

Remove punctuation

3. Vector Space Model

Vectorize

Cosine similarity

4. Similar issues found

(examples)Support for range skips in compaction filterSupport for EventListeners in the Java API' u'Rocksdb compaction error

rocksdb crashed at rocksdb::InternalStats::DumpCFStatsRocksDB shouldn't determine at build time whether to use jemalloc / tcmalloc

Option to expand range tombstones in db_bench' u'allow_os_buffer option'Collapse range deletions for faster range scans

Range deletions unsupported in tailing iteratorCollapse range deletions for faster range scans

EnvPosixTestWithParam should wait for all threads to finishtest failure: ./db_test2 - all threads in pthread_cond_wait

Fix purging non-obsolete log filesWith 4.x LOG files include frequent options Dump (leads to large LOG files)

Topic Modeling

Topic ModelingTopic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body

Latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003)

● Most popular form of topic modelling● Views each document as a distribution of topics● Views each topic as a distribution of words● Over many iterations, algorithm infers most likely distributions

Our Approach

● Get issues for a single GitHub repository● Each issue and its comments is a document● Run LDA to produce topics

○ Mallet library

User Matching

Assumptions:

● Users specialize in different areas within a repository

● These specializations are reflected in the topics

User Matching

● For each user:○ Get the comments associated with the user○ Apply the topic model to these comments○ Receive topic distribution for that user

User Matching

● For a query:○ Apply topic model to the query○ Dot product query topic vector with each

user topic vector○ Multiply by log(#user issues)

DEMO

Further Development

● Improve stopwords● Add stemming to remove word ambiguity● Improve weighting based on total number of issues

Conclusion- Github API is very useful and has lot of useful data- No ground truth to compare with- The application from demo could be used to notify the user about the issues

they can solve- Issues will get solved faster- Users don’t have to search for the issues they can work on

- Grouping similar issues, issue recommender

Thank YouQuestions?

CS273: Data and Knowledge bases

Recommended