Using Natural Language Processing on large Github data

- John Alexander and Harshitha ChidanandaCS273: Data and Knowledge bases

UCSB, Fall 2016

Outline

● Github

● Knowledge base

● Data

● Goals

● Vector Space Model

● Topic Modeling

● Gitsmart Demo

● Conclusion

GithubGitHub is a web-based Git repository hosting service.

It offers

- distributed version control - source code management (SCM) functionality of Git

It provides:

- access control - bug tracking- feature requests- task management

Our knowledge baseThe aim of the project is to use existing data in large well-developed GitHub repositories to build a knowledge base that can assist in the continued development of the project and in more quickly resolving issues posted to the repository.

Problem:

Currently, issues and pull requests for large projects are manually curated.

Proposed solution:

By providing a queryable, largely unsupervised means of tracking input from developers and users, we can significantly improve the efficiency of project curators.

Goals:

- Use Natural Language Processing to get meaningful data from text in the repository- Find similar issues- Recommend people who could work on the issues in the repository based on their previous work- Draw relationship between issues, pull requests and users

Vector Space Modeling

Vector Space Model- Also called as ‘term vector

model’ or ‘vector processing model’

- Represents documents by term sets

- Compares global similarities between queries and documents used in information filtering, information retrieval, indexing and relevancy rankings

1. Github issues data from GithubAPI

Organization: Facebook

Repo: rockdb

Specifically: Number, title and body

2. Preprocess data

Remove stop words

Remove punctuation

3. Vector Space Model

Vectorize

Cosine similarity

4. Similar issues found

(examples)Support for range skips in compaction filterSupport for EventListeners in the Java API' u'Rocksdb compaction error

rocksdb crashed at rocksdb::InternalStats::DumpCFStatsRocksDB shouldn't determine at build time whether to use jemalloc / tcmalloc

Option to expand range tombstones in db_bench' u'allow_os_buffer option'Collapse range deletions for faster range scans

Range deletions unsupported in tailing iteratorCollapse range deletions for faster range scans

EnvPosixTestWithParam should wait for all threads to finishtest failure: ./db_test2 - all threads in pthread_cond_wait

Fix purging non-obsolete log filesWith 4.x LOG files include frequent options Dump (leads to large LOG files)

Topic Modeling

Topic ModelingTopic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body

Latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003)

● Most popular form of topic modelling● Views each document as a distribution of topics● Views each topic as a distribution of words● Over many iterations, algorithm infers most likely distributions

Our Approach

● Get issues for a single GitHub repository● Each issue and its comments is a document● Run LDA to produce topics

○ Mallet library

User Matching

Assumptions:

● Users specialize in different areas within a repository

● These specializations are reflected in the topics

User Matching

● For each user:○ Get the comments associated with the user○ Apply the topic model to these comments○ Receive topic distribution for that user

User Matching

● For a query:○ Apply topic model to the query○ Dot product query topic vector with each

user topic vector○ Multiply by log(#user issues)

Further Development

● Improve stopwords● Add stemming to remove word ambiguity● Improve weighting based on total number of issues

Conclusion- Github API is very useful and has lot of useful data- No ground truth to compare with- The application from demo could be used to notify the user about the issues

they can solve- Issues will get solved faster- Users don’t have to search for the issues they can work on

- Grouping similar issues, issue recommender

Thank YouQuestions?

CS273: Data and Knowledge bases

Using Natural Language Processing on large Github data

Data & Analytics

Data Processing - GitHub Pages · Data Processing •Microprocessors •Multi-core Processors •Supercomputers •… all roads lead to Rome Cloud! 50. 51 Microprocessors Processing

Handling and Processing Strings in R - GitHub Pages

Parallel processing of large graphs

Fundamentals of Media Processing - GitHub Pages

Large Scale Processing with Django

Processing of large document collections

Module 9 - Advanced Data Processing · Module 14: Advanced Data Processing 3 Data Processing Large Numbers of Files For processing large sets of cast data, batch mode processing automates

Large-Scale Bounded Distortion Mappings - GitHub …for demanding geometry processing tasks, involving large-scale meshes or requiring interactive rates for moderate-scale problems,

Large Scale Image Processing

Github github-github

Block Processing Large Images

Large-Scale Data Processing with MapReduce

Large Mesh Simplification using Processing Sequences

Processing Big Data with Hive - GitHub

Using GitHub in Large Software Engineering Classes: An

Recent Advancements in Event Processing - GitHub Pagesmiyurud.github.io/papers/2018/event-processing-survey.pdf · Recent Advancements in Event Processing MIYURU DAYARATHNA,WSO2,

Enabling large-scale subsea processing

Database Systems 13 Stream Processing - GitHub Pages · Distributed stream processing engines, and “unified” batch/stream processing Proprietary systems: Google Cloud Dataflow,

Large-Scale Bounded Distortion Mappings - GitHub Pages · for demanding geometry processing tasks, involving large-scale meshes or requiring interactive rates for moderate-scale problems,

Natural Language Processing - GitHub Pagesanoopsarkar.github.io/nlp-class/assets/slides/hmm.pdfNatural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser