Upload
tom-mens
View
409
Download
2
Embed Size (px)
Citation preview
WhenGitHubmeetsCRAN:AnAnalysisofInter-RepositoryPackageDependencyProblems
AlexandreDecan TomMensMaelickClaes PhilippeGrosjean
So5wareEngineeringLabUniversityofMons,Belgium
h<p://[email protected]/genlog/projects/ecos
SANER–Osaka,Japan,March2016
• Aso5wareecosystemis:• “acollec@onofso5wareproductsthathavesomegivendegreeofsymbio@crela@onships.”[Messerschmi<&Szyperski2003]
• “acollec@onofso5wareprojectsthataredevelopedandevolvetogetherinthesameenvironment.”[Lungu2008]
• Apackage-basedecosystem• iscomposedofinterdependentso5warepackages…• …thatcanbeinstalledanddistributedusingapackagemanager
• Examples
Package-basedSo5wareEcosystems
SANER–Osaka,Japan,March2016
TheRecosystem
• R=popularopensourcelanguageandso5wareenvironmentforsta@s@[email protected]
• Languagepopularity(IEEESpectrum)
2015 2014
SANER–Osaka,Japan,March2016
TheRecosystemWherecanwefindRpackages?
cran.r-project.org
www.bioconductor.org
r-forge.r-project.org
www.github.org
Numberofpackages(March2015)
6,411
997
1,883
5,150
SANER–Osaka,Japan,March2016
CRANdistribu@onplakorm
• But…– “ItusedtobethatpuDngyourpackageonCRANalsogaveitsomeexposure,butwith>6000packages,that’snolongerquitetrue.”[Broman2015]
– “Thenon-transparentnatureoftheCRANsubmission/rejecUonprocessisparUcularlyatissue.h<ps://github.com/stnava/ANTsR/issues/8
ThemainadvantagetogeDngyourpackageonCRANisthatitwillbeeasierforuserstoinstall.
YourpackagewillalsobetesteddailyonmulUplesystems.“Rpackageprimer:aminimaltutorial”,2015.KarlBroman
SANER–Osaka,Japan,March2016
CRANdistribu@onplakorm
Credits:h<p://[email protected]/cran-gephi/
PackagedependencygraphofCRAN
6,706unarchivedpackagesonJune1,2015
SANER–Osaka,Japan,March2016
CRANdistribu@onplakorm
• Con@nuousintegra@onprocessbasedonRCMDChecktoensureoverallconsistency– Dailycheck,onmul@pleOS,ofallCRANpackages– Looksforcommonproblemsinpackagestructure,metadata,code,documenta@on,data,…
– Packagesnotpassingthecheckneedtobeupdatedtoavoidgepngarchived
– Packagescanonlydependonthelatestversionofanotherpackageèiftheydependonafailedpackagetheymayalsofailthecheck
SANER–Osaka,Japan,March2016
CRANdistribu@onplakorm
AutomateddailyCRANpackageCheck(2016-02-23):Summaryofresults
SANER–Osaka,Japan,March2016
CRANdistribu@onplakorm
AutomateddailyCRANpackageCheck(2016-02-23):Resultsbypackage
SANER–Osaka,Japan,March2016
RonGitHub
githut.info Q4,2014
SANER–Osaka,Japan,March2016
HowimportantisGitHubforRpackages?
• CRANusedtobethemaindistribu@onplakormforRpackages
• WiththeappearanceofGitHub,thisisnolongerthecase
IsGitHubstar@ngtoreplaceorcomplementCRANaspackagedistribu@onmedium?
SANER–Osaka,Japan,March2016
HowimportantisGitHubforRpackages?
• GitHubhostsmanyRpackages• IncreasingnumberofnewRpackagesonGitHub
June1,2015
monthlynumberofnewpackages
SANER–Osaka,Japan,March2016
HowimportantisGitHubforRpackages?
IsGitHubbeingusedasadistribu(onpla/ormforRpackages?• GitHubhoststhousandsofRpackages
• >40%ofGitHubRpackagesareintendedtobedistributedastheycontaininstruc@onstoinstallthemfromGitHub.
• AnincreasingnumberofRpackagesonGitHubisnotdistributedelsewhere
• Packageslikedevtoolsfacilitatedistribu@onfromGitHub
SANER–Osaka,Japan,March2016
ManagingRpackagedependencies
JeffLeek
JeroenOoms
ThenumberofpackagesonCRANandotherrepositorieshasincreasedbeyondwhatmighthavebeenforeseen,andisrevealingsomelimitaUonsofthecurrentdesign.Onesuchproblemisthegenerallackofdependencyversioningintheinfrastructure.“Possibledirec@onsforimprovingdependencyversioninginR,”
RJournal5(1):197–206,June2013
OneofthebestthingsabouttheRecosystemisbeingabletorelyonotherpackagessothatyoudon’thavetowrite
everythingfromscratch.Butthereisahardbalancetostrikewithkeepingthedependencylistsmall.“HowIdecidewhentotrustanRpackage,”
simplysta@[email protected]/?p=4409,November2015.
SANER–Osaka,Japan,March2016
ManagingRpackagedependencies
InwhichrepositoryarepackagedependenciessaUsfied?
depends on cmake, as well as some packages that dependon commercial packages not available in CRAN.
From a quantitative point of view, we found many Rpackages that are distributed on GitHub, providing specificinstallation instructions. We also found that packages thatare only available on GitHub are younger than the GitHubpackages that are also distributed on CRAN.
Conclusion. R package developers are using GitHubas a distribution platform for R packages.
V. RQ2: TO WHICH EXTENT DO R PACKAGES SUFFERFROM INTER-REPOSITORY DEPENDENCY PROBLEMS?
The R package devtools provides a wrapper around R’sbuilt-in installation manager to facilitate the installation of Rpackages from different sources. Unfortunately, repositoriessuch as GitHub have no central listing of available Rpackages. GitHub contains millions of Git projects filledwith content from various programming languages. Evenif we limit GitHub projects to those tagged with the Rlanguage, the vast majority does not contain an R package.
The lack of a central listing of packages prevents devtoolsto automatically find and install dependencies. Additionally,the same package can be hosted on multiple repositoriesand in different versions, making the problem of dependencyresolution even more difficult.
To analyze the extent to which R packages hosted onGitHub are subject to inter-repository dependency problemswe show that, while CRAN is the main source of packagesto depend upon, the CRAN packages required by GitHubpackages are the ones that are updated the most frequently.It shows that GitHub packages suffer from inter-repositoryproblems because there are many packages with error statuson CRAN caused by backward incompatible changes in thedependencies. In order to achieve this goal we study thefollowing subquestions:
• RQ2a: In which repository are package dependenciessatisfied? We provide evidence that, while CRAN is themain source of packages to depend upon, many GitHubpackages have dependencies to packages not providedby CRAN.
• RQ2b: Are CRAN packages more frequently updatedthan GitHub packages? We provide evidence thatCRAN packages required by GitHub packages are moreprone to be updated than CRAN packages not requiredby GitHub packages.
• RQ2c: How often do package updates cause backwardincompatible changes? We provide evidence that manyCRAN package become broken due to CRAN packagesupdates. We show how this problem is addressed onCRAN and maintainers of GitHub packages do notbenefit from CRAN’s solution.
RQ2a: In which repository are package dependencies sat-isfied?
We determined in which repository the dependencies ofR packages hosted in CRAN or GitHub are satisfied. Theresults are summarised in Fig. 5. For each edge fromrepository A to repository B, the blue label on top representsthe percentage of R packages from A that have a dependencysatisfied by B, and the red label below the edge representsthe percentage of dependencies of R packages in A thatare satisfied by B. For example, 70.6% of all R packageson GitHub depend on at least one package in CRAN, while85.3% of all declared dependencies in R packages on GitHubare satisfied by CRAN packages. Note that, if a package ishosted by both CRAN and GitHub, we count it as a CRANpackage, because for dependency satisfaction we privilegethe officially distributed package over its development ver-sion.
Other repositories
CRAN GitHub
0.5
1.5
98.0
85.3
10.7
4.00.8
1.8
61.5
70.6
11.8
8.6
Figure 5. Inter-repository package dependencies for CRAN and GitHub.In blue: percentage of packages. In red: percentage of dependencies.
Because of the constraints imposed by CRAN’s dailyR CMD check, CRAN is self-contained. 98.0% of itsdependencies are satisfied within CRAN. Another 1.5% issatisfied by BioConductor and Omegahat, which are packagedistributions that are taken into account by the R CMDcheck.
The presence of 0.8% CRAN packages that depend onGitHub packages (totalling 0.5% of CRAN’s package de-pendencies) may seem to contradict this assertion, as the RCMD check does not allow the installation of packages fromGitHub. However, these 0.8% dependencies target packagesthat are distributed on BioConductor or Omegahat, and forwhich a development version is available on GitHub.
This implies that nearly all CRAN packages satisfy theirdependencies within CRAN. In addition, 70.6% of the Rpackages on GitHub depend on a package belonging toCRAN (totalling 85.3% of GitHub’s package dependencies).This strongly suggests that CRAN is at the center of theecosystem, and that it is nearly impossible to install Rpackages hosted on GitHub without relying on CRAN.
Considering R packages on GitHub, 8.6% of them havea dependency in GitHub and 11.8% have a dependencythat is not in CRAN or GitHub. The union of these R
Blue=%ofpackagesinrepositoryAthathaveadependencysa@sfiedbyapackageinrepositoryB
Red=%ofdependenciesofpackagesinrepositoryAthataresa@sfiedbyapackageinrepositoryB
(BioConductor,OmegaHat,…)
CRANisself-contained,soitdoesnotsufferfrominter-repository
dependencyproblems.ButwhataboutGitHub?
SANER–Osaka,Japan,March2016
ManagingRpackagedependencies
InwhichrepositoryarepackagedependenciessaUsfied?
depends on cmake, as well as some packages that dependon commercial packages not available in CRAN.
From a quantitative point of view, we found many Rpackages that are distributed on GitHub, providing specificinstallation instructions. We also found that packages thatare only available on GitHub are younger than the GitHubpackages that are also distributed on CRAN.
Conclusion. R package developers are using GitHubas a distribution platform for R packages.
V. RQ2: TO WHICH EXTENT DO R PACKAGES SUFFERFROM INTER-REPOSITORY DEPENDENCY PROBLEMS?
The R package devtools provides a wrapper around R’sbuilt-in installation manager to facilitate the installation of Rpackages from different sources. Unfortunately, repositoriessuch as GitHub have no central listing of available Rpackages. GitHub contains millions of Git projects filledwith content from various programming languages. Evenif we limit GitHub projects to those tagged with the Rlanguage, the vast majority does not contain an R package.
The lack of a central listing of packages prevents devtoolsto automatically find and install dependencies. Additionally,the same package can be hosted on multiple repositoriesand in different versions, making the problem of dependencyresolution even more difficult.
To analyze the extent to which R packages hosted onGitHub are subject to inter-repository dependency problemswe show that, while CRAN is the main source of packagesto depend upon, the CRAN packages required by GitHubpackages are the ones that are updated the most frequently.It shows that GitHub packages suffer from inter-repositoryproblems because there are many packages with error statuson CRAN caused by backward incompatible changes in thedependencies. In order to achieve this goal we study thefollowing subquestions:
• RQ2a: In which repository are package dependenciessatisfied? We provide evidence that, while CRAN is themain source of packages to depend upon, many GitHubpackages have dependencies to packages not providedby CRAN.
• RQ2b: Are CRAN packages more frequently updatedthan GitHub packages? We provide evidence thatCRAN packages required by GitHub packages are moreprone to be updated than CRAN packages not requiredby GitHub packages.
• RQ2c: How often do package updates cause backwardincompatible changes? We provide evidence that manyCRAN package become broken due to CRAN packagesupdates. We show how this problem is addressed onCRAN and maintainers of GitHub packages do notbenefit from CRAN’s solution.
RQ2a: In which repository are package dependencies sat-isfied?
We determined in which repository the dependencies ofR packages hosted in CRAN or GitHub are satisfied. Theresults are summarised in Fig. 5. For each edge fromrepository A to repository B, the blue label on top representsthe percentage of R packages from A that have a dependencysatisfied by B, and the red label below the edge representsthe percentage of dependencies of R packages in A thatare satisfied by B. For example, 70.6% of all R packageson GitHub depend on at least one package in CRAN, while85.3% of all declared dependencies in R packages on GitHubare satisfied by CRAN packages. Note that, if a package ishosted by both CRAN and GitHub, we count it as a CRANpackage, because for dependency satisfaction we privilegethe officially distributed package over its development ver-sion.
Other repositories
CRAN GitHub
0.5
1.5
98.0
85.3
10.7
4.00.8
1.8
61.5
70.6
11.8
8.6
Figure 5. Inter-repository package dependencies for CRAN and GitHub.In blue: percentage of packages. In red: percentage of dependencies.
Because of the constraints imposed by CRAN’s dailyR CMD check, CRAN is self-contained. 98.0% of itsdependencies are satisfied within CRAN. Another 1.5% issatisfied by BioConductor and Omegahat, which are packagedistributions that are taken into account by the R CMDcheck.
The presence of 0.8% CRAN packages that depend onGitHub packages (totalling 0.5% of CRAN’s package de-pendencies) may seem to contradict this assertion, as the RCMD check does not allow the installation of packages fromGitHub. However, these 0.8% dependencies target packagesthat are distributed on BioConductor or Omegahat, and forwhich a development version is available on GitHub.
This implies that nearly all CRAN packages satisfy theirdependencies within CRAN. In addition, 70.6% of the Rpackages on GitHub depend on a package belonging toCRAN (totalling 85.3% of GitHub’s package dependencies).This strongly suggests that CRAN is at the center of theecosystem, and that it is nearly impossible to install Rpackages hosted on GitHub without relying on CRAN.
Considering R packages on GitHub, 8.6% of them havea dependency in GitHub and 11.8% have a dependencythat is not in CRAN or GitHub. The union of these R
Blue=%ofpackagesinrepositoryAthathaveadependencysa@sfiedbyapackageinrepositoryB
Red=%ofdependenciesofpackagesinrepositoryAthataresa@sfiedbyapackageinrepositoryB
(BioConductor,OmegaHat,…)
SANER–Osaka,Japan,March2016
ManagingRpackagedependencies
Conclusions:• MostpackagesonGitHubrelyonpackagesinCRAN…• …butthenumberofpackagesonGitHubthatrelyon
packagesoutsideofCRANisincreasing
DoRpackagesonGitHubsufferfrominter-repositorydependencyproblems?
SANER–Osaka,Japan,March2016
Inter-repositorydependencyproblems
Observa@ons• CRAN’spolicyrequirestodependonthemostrecent
packageversion• GitHubpackagesfrequentlydependonCRANpackages• TheycannotbenefitfromCRAN’sdailycheck• Butcanuseothercon@nuousintegra@onprocesses
suchasTravisCI• Althoughlessthan25%actuallydothis…
Ipersonallythinkit’sREALLYrelevanttoatleastbeABLEtobeveryspecificandrigidwithregardtoyourdependencies.
AndIthinktheRuniversecouldprovidebeaertoolstofittheneedsofdevelopersandprofessionalsoutthereinabeaerway.
SANER–Osaka,Japan,March2016
Inter-repositorydependencyproblems
Observa@ons• CRAN’spolicyrequirestodependonthemostrecent
packageversion• GitHubpackagesfrequentlydependonCRANpackages• TheycannotbenefitfromCRAN’sdailycheck• Butcanuseothercon@nuousintegra@onprocesses
suchasTravisCI• Althoughlessthan25%actuallydothis…
Ipersonallythinkit’sREALLYrelevanttoatleastbeABLEtobeveryspecificandrigidwithregardtoyourdependencies.
AndIthinktheRuniversecouldprovidebeaertoolstofittheneedsofdevelopersandprofessionalsoutthereinabeaerway.
èCRANpackageupdatesmorelikelytoleadtoproblemsforGitHubthanforCRAN
packages
SANER–Osaka,Japan,March2016
Inter-repositorydependencyproblems
CRANpackageupdatesmorelikelytoleadtoproblemsforGitHubthanforCRANpackages
SurvivalAnalysis:ProbabilitythataCRANpackageisNOTupdatedperiod12/2014–6/20153,740consideredpackages
IfaCRANpackageisrequiredbyaGitHubpackage,itismore
likelytobeupdated.
SANER–Osaka,Japan,March2016
Inter-repositorydependencyproblems
HowocendopackageupdatescausebackwardincompaUblechangesonCRAN?
SANER–Osaka,Japan,March2016
Inter-repositorydependencyproblems
HowocendopackageupdatescausebackwardincompaUblechangesonCRAN?
• 41%oftheerrorsinCRANpackagesarecausedbybackwardincompa@blechanges(analysisbasedondailyRCMDCheckresultsover2-yearperiod)
• Weobservedbackwardincompa@blechangesonaverageonceevery20updates(perioddecember2014àJune2015)
SANER–Osaka,Japan,March2016
Inter-repositorydependencyproblems
HowocendopackageupdatescausebackwardincompaUblechanges?
Rpackagemaintainerssharethisconcern:
ItismoreandmoreofapainifthepackageI’mdependingonbreaks
IfIhavetoloadaDependspackage,itaddsasignificantburden:IhavetocheckforconflictseveryUmeItakeadependencyonanewpackage.
SANER–Osaka,Japan,March2016
Inter-repositorydependencyproblems
HowocendopackageupdatescausebackwardincompaUblechanges?
Rpackagemaintainerssharethisconcern:
Theriskofthingsbreakingatsomepointduetothefactthataversionofadependencyhaschangedwithoutyouknowingaboutitisimmense.
IhadonecasewheremypackageheavilydependedonanotherpackageandacerawhilethatpackagewasremovedfromCRANandstoppedbeingmaintained.
SANER–Osaka,Japan,March2016
Wrap-up
• GitHubisbecomingincreasinglyusedtodevelopanddistributeRpackages
• Packageupdateso5enleadtoincompa@bili@esindependingpackages
• GitHubpackagescannotbenefitfromCRAN’scon@nuousintegra@onprocess
Onerecentexamplewastheforcedroll-backoftheggplot2updatetoversion0.9.0,becausetheintroducedchangescausedseveralotherpackagestobreak.”
JeroenOoms
Onerecentexamplewastheforcedroll-backoftheggplot2updatetoversion0.9.0,becausetheintroducedchangescausedseveralotherpackagestobreak.
“Possibledirec@onsforimprovingdependencyversioninginR,”RJournal5(1):197–206,June2013
SANER–Osaka,Japan,March2016
Wrap-up
• Rcommunityneedsbe<ertoolsupportformanagingdependenciesanddealingwithpackageversions.
• Suchtoolsarestar@ngtoemerge:– devtools,drat,packrat,miniCRAN,checkpoint,…– R-Hub:ongoingprojectacceptedbyRconsor@um
• Infrastructurefordeveloping,building,tes@ng,valida@on,distribu@on,con@nuousintegra@onofRpackages.
• AservicethatiscomplementarytoCRANandR-Forge
SANER–Osaka,Japan,March2016
• OnthemaintainabilityofCRANpackages[CSMR-WCRE2014]
• maintaineR,aweb-baseddashboardformaintainersofCRANpackages[ICSME2015]
• AnempiricalstudyofidenUcalfuncUonclonesinCRAN[IWSC2015@SANER]
• Anonymizede-mailinterviewswithRpackagemaintainersacUveonCRANandGitHub[TechnicalReport]
Website:[email protected]/genlog/projects/ecos/
27
References
SANER–Osaka,Japan,March2016
References
/var/folders/lm/qbg6frqd6435jxm9w70y7h480000gn/T/com.apple.Preview/com.apple.Preview.PasteboardItems/csmr13-140314140935-phpapp02(dragged).pdf
/var/folders/lm/qbg6frqd6435jxm9w70y7h480000gn/T/com.apple.Preview/com.apple.Preview.PasteboardItems/csmr13-140314140935-phpapp02(dragged).pdf
The Evolution of the R Software Ecosystem
Daniel M. GermanUniversity of Victoria
Bram AdamsÉcole Polytechnique
de Montréal
Ahmed E. HassanQueen's University
CSMR2013