50
Analysing GitHub commits with R Barbara Fusinska @BasiaFusinska barbarafusinska.com

Analysing GitHub commits with R

Embed Size (px)

Citation preview

Page 1: Analysing GitHub commits with R

Analysing GitHub commits with R

Barbara Fusinska@BasiaFusinska

barbarafusinska.com

Page 2: Analysing GitHub commits with R

About me

ProgrammerManager

Sweet tooth

Page 3: Analysing GitHub commits with R

Goals

AnalysingGitHub data

Methods of dataexploration & analysis

Learn R

Page 4: Analysing GitHub commits with R

Agenda

• Data analysis• R basics• Data exploration & processing• Big Data• Outside GitHub:– Data visualisation– Libraries & tools

Page 5: Analysing GitHub commits with R

Data analysis process

Raw Data Processed Data

Data Analysis & Visualization

Exploratory Data Analysis

Page 6: Analysing GitHub commits with R

What do we want to find out?

Page 7: Analysing GitHub commits with R

GitHut Visualisation

http://githut.info/

Page 8: Analysing GitHub commits with R

GitHub Archive

https://www.githubarchive.org/

Page 9: Analysing GitHub commits with R

Google BigQuery

https://cloud.google.com/bigquery/what-is-bigquery

Page 10: Analysing GitHub commits with R

GitHub API

https://developer.github.com/v3/

Page 11: Analysing GitHub commits with R

DEMO: R basics

• RStudio• Data types• Data filtering

Page 12: Analysing GitHub commits with R
Page 13: Analysing GitHub commits with R
Page 14: Analysing GitHub commits with R

DEMO: Active repositiories

• What is an active repository?• Where can we get data from?• How to get useful information?• How to handle missing data?

Page 15: Analysing GitHub commits with R

GitHub Archive CreateEvent

Page 16: Analysing GitHub commits with R

GitHub Archive PushEvent

Page 17: Analysing GitHub commits with R

GitHub Archive PullRequestEvent

Page 18: Analysing GitHub commits with R

Expected result

Page 19: Analysing GitHub commits with R

Read Pull Requests

Page 20: Analysing GitHub commits with R

Summary Output

Page 21: Analysing GitHub commits with R

Unique results

Page 22: Analysing GitHub commits with R

Language information

Page 23: Analysing GitHub commits with R

Missing information

Page 24: Analysing GitHub commits with R

Filter missing

Page 25: Analysing GitHub commits with R

Plotting

Page 26: Analysing GitHub commits with R

Languages

Page 27: Analysing GitHub commits with R

DEMO: Processing various data sources

• Active repositories – from Create, Push and PullRequest events

• Missing language information:– Google BigQuery– GitHub API

Page 28: Analysing GitHub commits with R

Expected result

Page 29: Analysing GitHub commits with R

Google BigQuery

Page 30: Analysing GitHub commits with R

GitHub API from R

Page 31: Analysing GitHub commits with R

GitHub API Search

Page 32: Analysing GitHub commits with R

Combining data

Page 33: Analysing GitHub commits with R

Sorted data

Page 34: Analysing GitHub commits with R

Languages

Page 35: Analysing GitHub commits with R

Big Data in R

• What’s _Big_Data_ anyway?• R processes data in memory

• Bring down only the data you need

• Streaming the data from database

Page 36: Analysing GitHub commits with R

R, Hadoop, and How They Work Together

• Hadoop Streaming – utilities available as R scripts• ORCH (Oracle R Connector for Hadoop) - provides access to a

Hadoop cluster from R• RHIPE (R and Hadoop Integrated Programming Environment)

– techniques designed for analysing large sets of data• Rhadoop - R packages that allow users to manage and analyse

data with Hadoop

https://blog.udemy.com/r-hadoop/

Page 37: Analysing GitHub commits with R

Azure Machine Learning Studio

http://channel9.msdn.com/Blogs/Windows-Azure/R-in-Azure-ML-Studio

Page 38: Analysing GitHub commits with R

DEMO: Processing multiple files

• Reading files line by line• Reading multiple files• Saving files

Page 39: Analysing GitHub commits with R

Analysing week

Page 40: Analysing GitHub commits with R

Read multiple files line by line

Page 41: Analysing GitHub commits with R

Push Events during the week

Page 42: Analysing GitHub commits with R

DEMO: Plotting in R

http://www.harding.edu/fmccown/r/

Page 43: Analysing GitHub commits with R

Qplot - dots

Page 44: Analysing GitHub commits with R

qqplot - lines

Page 45: Analysing GitHub commits with R

qqplot - bars

Page 46: Analysing GitHub commits with R

Libraries & Tools

• DataScience

• JSON processing• Http requests• Summarising• Plotting

Page 47: Analysing GitHub commits with R

DEMO: Downsampling

https://github.com/javiljoen/LTTB

Page 48: Analysing GitHub commits with R

DEMO: Anomalies detection

https://github.com/twitter/AnomalyDetection

Page 49: Analysing GitHub commits with R

To summarize…

• GitHub Archive, GitHub API

• R language – RStudio, types, data exploration, I/O operations, etc.

• Big Data in R• 3rd party & build-in libraries• Visualization

Page 50: Analysing GitHub commits with R

Thank youBarbara [email protected]

https://github.com/BasiaFusinska/RTalk