27
NO ONE KNOWS WHAT IT’S LIKE TO BE THE BAD MAN: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE Max Kuhn Director of Statistics at Pfizer R&D Zachary Deane–Mayer Cognius O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci www.opendatascience.com

THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Embed Size (px)

Citation preview

Page 1: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

NO ONE KNOWS WHAT IT’S LIKE TO BE THE BAD MAN:

THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Max Kuhn Director of Statistics at Pfizer R&D

Zachary Deane–Mayer Cognius

O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015

@opendatasci

www.opendatascience.com

Page 2: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Nobody Knows What It’s Like To Be the Bad ManThe Development Process for the caret Package

Max KuhnPfizer Global R&D

[email protected]

Zachary Deane–MayerCognius

[email protected]

Page 3: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Model Function Consistency

Since there are many modeling packages in R written by differentpeople, there are inconsistencies in how models are specified andpredictions are created.

For example, many models have only one method of specifying themodel (e.g. formula method only)

> ## only one way here:

> rpart(y ~ ., data = dat)

>

> ## and both ways here:

> lda(y ~ ., data = dat)

>

> lda(x = predictors, y = outcome)

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 2 / 26

Page 4: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Generating Class Probabilities Using Different

Packages

Function predict Function SyntaxMASS::lda predict(obj) (no options needed)

stats:::glm predict(obj, type = "response")

gbm::gbm predict(obj, type = "response", n.trees)

mda::mda predict(obj, type = "posterior")

rpart::rpart predict(obj, type = "prob")

RWeka::Weka predict(obj, type = "probability")

caTools::LogitBoost predict(obj, type = "raw", nIter)

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 3 / 26

Page 5: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

The caret Package

The caret package was developed to:

create a unified interface for modeling and prediction (interfacesto 183 models)

streamline model tuning using resampling

provide a variety of “helper” functions and classes forday–to–day model building tasks

increase computational efficiency using parallel processing

First commits within Pfizer: 6/2005, First version on CRAN: 10/2007

Website: http://topepo.github.io/caret/

JSS Paper: http://www.jstatsoft.org/v28/i05/paper

Model List: http://topepo.github.io/caret/bytag.html

Many computing sections in Applied Predictive Modeling

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 4 / 26

Page 6: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Easily Switching Between Models> library(doMC)

> registerDoMC(cores=10)

>

> ctlr <- trainControl(classProbs = TRUE, method = "repeatedcv")

> gbm_mod <- train(Class ~ ., data = training,

+ method = "gbm",

+ trControl = ctlr,

+ ## gbm argument:

+ verbose = FALSE)

>

> pls_mod <- train(Class ~ ., data = training,

+ method = "pls",

+ tuneLength = 10,

+ preProc = c("center", "scale", "spatailSign"),

+ trControl = ctlr)

>

> pls_search <- gafs(x = training[, -1], y = training$Class,

+ gafsControl = gafsControl(method = "cv", functions = rfGA),

+ ## train options:

+ method = "pls",

+ tuneLength = 10,

+ preProc = c("center", "scale", "spatailSign"),

+ trControl = ctlr)

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 5 / 26

Page 7: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Package Dependencies

One thing that makes caret different from most other packages isthat it uses code from an abnormally large number (> 80) of otherpackages.

A refresher:

Depends: required for package to function; loaded when caret

is loaded

Imports: required for package to function; not loaded

Suggests: the package uses it sometimes; not loaded

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 6 / 26

Page 8: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

“Simple’”” Example (38 Nodes)

caret

car

reshape2

foreach

plyr

nlme

BradleyTerry2

MASS

mgcv

nnet

pbkrtest

quantreg

stringr

Rcpp

codetools

iterators

lattice

brglmgtools

lme4

digest

gtable

scales

proto

profileModel

minqa

nloptr

Matrix

RcppEigen

SparseM

RColorBrewer

dichromat

munsell

labeling

stringi

magrittr

colorspace

ggplot2

...

...

...

...

...

ImportsDependsSuggestsEnhancesLinkingTo

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 7 / 26

Page 9: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Package Dependencies

Originally, these were in the Depends field of the DESCRIPTION filewhich caused all of them to be loaded with caret.

For many years, they were moved to Suggests, which solved thatissue.

However, their formal dependency in the DESCRIPTION file requiredCRAN to install hundreds of other packages to check caret. Themaintainers were not pleased.

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 8 / 26

Page 10: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Package Dependencies in caret Version 5.17-07

ada

arm

Boruta

bst

C50

car

caTools

class

Cubiste1071

earth

elasticnet

ellipse

evtreeextraTrees

fastICA

foba

gam

gbm

glmnet

hda

HDclassif

HiDimDA

Hmisc

ipred

kernlabkknn

klaR

kohonen

KRLS

lars

leaps

LogicForest

LogicReg

MASS

mboost

mda

mgcv

mlbench

neuralnet

nnet

nodeHarvest

obliqueRF

pamr

partDSA

party

penalized

penalizedLDA

pls

pROC

protoclass

proxy

qrnn

quantregForest

randomForest

RANN

relaxo

rFerns

rocc

rpart

rrcov

RRF

rrlda

RSNNS

RWeka

sda

sparseLDA

spls

stepPlr

superpc

cluster

foreach

lattice

plyrreshape2

Matrix

lme4

abind

coda

nlme

Rcpp

minqa

nloptr

RcppEigen

survival

partykit

pbkrtest

quantreg

SparseM

bitops

stringr

stringimagrittr

plotmo

plotrix

TeachingDemos

rJava

codetools

iterators

Formula

ggplot2

proto

scales

latticeExtra

acepack

foreign

gtable

gridExtra

digest

RColorBrewer

dichromat

munsell

labeling

colorspace

prodlim KernSmoothlava

numDeriv

igraph

combinat

CircStats

gtools

boot

stabs

nnls

quadprog

ROCR

gplots

gdata

mvtnorm

modeltools

strucchange

coin

zoo

sandwich

flsa

robustbasepcaPP

DEoptimR

mvoutlier

glasso

matrixcalc

sgeostat

robCompositions

GGally

sROC

reshape

RWekajars

entropy

corpcor

fdrtool

linprog

scalreg

lpSolve

effects

BradleyTerry2 brglm

profileModel

TH.data

evaluate

formatR

highr markdown

yaml

mime

polspline

multcomp

DBI

spam

maps

shapefiles

sp

maptools

ks

misc3d

rgl

multicool

brew

hdi

alr4

lmtestMatrixModels

survey

caret

xtable

testthat

akima

RUnit

knitr

chron

rms

mice

tables

scatterplot3d

som

biglm

fields

BayesX

VGAM

vcd

globaltest

Rmpi

microbenchmark

logcondens

doParallel

roxygen2

cba

mclust

crossval

itertools

optextras

qvcalc

relimp

xts

intervals

jsonlite

RCurl

R6

XML

gamlss.data

gamlss.dist

ucminf

BB

RcgminRvmmin

setRNG

dfoptim

svUnit

psychotools

timereg

RcppArmadillo

BH

timeDate

fit.models

network

gnm

assertthat

lazyeval

httpuv

htmltools

bdsmatrix

mratios

spacetime

FNN

deldir

tensor

polyclip

goftest

httr

memoise

whisker

rstudioapi

rversions

git2r

gamlss

LearnBayes

expm

PKPDmodels

MEMSS

mlmRev

optimx gamm4

inline

rbenchmark

highlight

pkgKitten

cmprsk

pmml

AER

psychotreetripack

logspline

nor1mix

dynlm

rpart.plot

tkrplot

tcltk2

R2wd

EBImage

png

mapproj

hexbin

graph

Rgraphviz

RGraphics

pixmap

metsgeepack

gof

ascii

igraphdata

ape

tseries

DAAG

fts

its

mondate

timeSeries

tis

robust

MPV

sfsmisc

catdata

intergraph

scagnostics

sna

tnet

poLCA

heplots

prefmod

dplyr

shiny

testit

coxme

SimComp

ISwR

RSQLite

rgdal

rgeos

gstat

spatstat

PBSmapping

rmarkdown

mitools

RODBCCompQuadForm

subselect

devtools

AGD

pan

Zelig

spdep

gpclib

VGAMdata

HSAUR

gclus

mix

care

binda

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 9 / 26

Page 11: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Package Dependencies

This problem was somewhat alleviated at the end of 2013 whencustom methods were expanded.

Although this functionality had already existed in the package forsome time, it was refactored to be more user friendly.

In the process, much of the modeling code was moved out of caret’sR files and into R objects, eliminating the formal dependencies.

Right now, the total number of dependencies is much smaller (2Depends, 7 Imports, and 25 Suggests).

This still affects testing though (described later). Also:

1 package is needed for this model and is not installed. (gbm).

Would you like to try to install it now?

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 10 / 26

Page 12: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

38 Dependencies for caret Version 6.0-47

●●

● ●

●●

●●●

●●

caret

lattice

ggplot2

car reshape2

foreach

plyrnlme

BradleyTerry2

lme4

brglm

gtools

MASS

mgcv

nnet

pbkrtest

quantreg

codetools

iterators

digest

gtablescales

proto

Rcpp

stringr

profileModel

Matrix

minqa

nloptr

RcppEigen

SparseM

RColorBrewer

dichromat

munsell

labeling

stringi

magrittr

colorspace

plotrix

TeachingDemos

survival

KernSmooth

lava

numDeriv

modeltools

mvtnorm

zoo

sandwich

mime

qvcalc

relimp

quadprog

optextras

bitops

class

plotmo

rpart

prodlim

combinat

coin

strucchange pls

cluster

latticeExtra

acepack

foreign

gridExtra

Formula

maps

sp

TH.data

evaluateformatR

highr

markdown

yaml

effects

gnm

ucminf

BB

RcgminRvmmin

setRNG

dfoptim

svUnit

gdata

caTools

lmtest

e1071

earthfastICA

gam

ipred

kernlab

klaR

ellipse

mda

mlbench

party

pROC

proxy

randomForest

RANN

spls

subselect

pamrsuperpc

Cubist

testthat

Hmisc

mapproj

hexbin

maptools

multcomp

knitr

alr4

bootleaps

MatrixModelsrgl

survey

abind

doParallel

itertools

prefmod

PKPDmodels

MEMSSmlmRev

optimx

gamm4

gplots

tripack

akima

logspline

nor1mix

dynlm

RUnit

graph

Rgraphviz

inline

rbenchmark

highlightpkgKittenexpm

vcd

...

...

...

...

...

ImportsDependsSuggestsEnhancesLinkingTo

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 11 / 26

Page 13: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

The Basic Release Process

1 create a few dynamic man pages

2 use R CMD check --as-cran to ensure passing CRAN testsand unit tests

3 update all packages (and R)

4 run regression tests and evaluate results

5 send to CRAN

6 repeat

7 repeat

8 install passed caret version

9 generate HTML documentation and sync github io branch

10 profit!

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 12 / 26

Page 14: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Toolbox

RStudio projects: a clean, self contained package developmentenvironment

I No more setwd(’/path/to/some/folder’) in scriptsI Keep track of project-wide standards, e.g. code formattingI An RStudio project was the first thing we added after moving

the caret repository to github

devtools: automate boring tasks

testthat: automated unit testing

roxygen2: combine source code with documentation

github: source control

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 13 / 26

Page 15: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

devtools

devtools::install builds package and installs it locally

devtools::check:

1 Builds documentation

2 Runs unit tests

3 Builds tarball

4 Runs R CMD CHECK

devtools::release builds package and submits it to CRAN

devtools::install github enables non-CRAN code distributionor distribution of private packages

devtools::use travis enables automated unit testing throughtravis-CI and test coverage reports through coveralls

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 14 / 26

Page 16: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Testing

Testing for the package occurs in a few different ways:

units tests via testthat and travis-CI

regression tests for consistency

Automated unit testing via testthat

devtools::use testhat

Unit tests prevent new features from breaking old code

All functions should have associated tests

Run during R CMD check --as-cran

Can specify that certain tests be skipped on CRAN

caret is slowly adding more testthat tests

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 15 / 26

Page 17: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

github + travis + coveralls

Travis and coveralls are tools for automated unit testing

Travis reports test failures and CRAN errors/warnings

Coveralls reports % of code covered by unit tests

Both automatically comment on the PR itself

Contributor submits code via a pull request

Travis notifies them of test failures

Coveralls notifies them to write tests for new functions

Automated feedback on code quality allows rapid iteration

Code review once unit tests and R CMD CHECK pass

github supports line-by-line comments

Usually several more iterations here

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 16 / 26

Page 18: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Writing new unit tests

Coveralls identifies un-tested code:

Developer chooses a file with low or no coverage

Developer writes unit tests

context('Test Data Splitting Functions')test_that('createTimeSlices', {

y <- 1:10

s1 <- createTimeSlices(y, initialWindow=5, horizon=1)

expect_equal(length(s1$train), 5)

expect_equal(s1$train$Training1, 1:5)

expect_equal(s1$test$Testing1, 6)

})

while(coverage<100%){repeat}

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 17 / 26

Page 19: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Regression Testing

Prior to CRAN release (or whenever required), a comprehensive set ofregression tests can be conducted.

All modeling packages are updated to their current CRAN versions.

For each model accessed by train, rfe, and/or sbf, a set of testcases are computed with the production version of caret and thedevel version.

First, test cases are evaluated to make sure that nothing has beenbroken by updated versions of the consistuent packages.

Diffs of the model results are computed to assess any differences incaret versions.

This process takes approximately 3hrs to complete using make -j 12

on a Mac Pro.

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 18 / 26

Page 20: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Regression Testing

Typical tests include:

tests for different resampling methods (LOOCV!)

formula vs. non-formula interface

predictions

variable importance

ancillary classes/functions (e.g. predictors)

In some cases, we need to correlate results between versions due torandom numbers without seed control.

I send a lot of emails/pull requests to package maintainers (e.g. classprobabilities don’t sum to 1, predictions fail with n = 1, etc. )

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 19 / 26

Page 21: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Regression Testing

$ R CMD BATCH move_files.R

$ cd ~/tmp/2015_04_19_09__6.0-41/

$ make -j 12 -i

2015-04-19 09:13:44: Starting ada

2015-04-19 09:13:44: Starting AdaBag

2015-04-19 09:13:44: Starting AdaBoost.M1

2015-04-19 09:13:44: Starting ANFIS

:

make: [FH.GBML.RData] Error 1 (ignored)

:

2015-04-19 12:03:52: Finished WM

2015-04-19 12:04:48: Finished xyf

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 20 / 26

Page 22: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Documentation

caret originally contained four package vignettes with in–depthdescriptions of functionality with examples.

However, this added time to R CMD check and was a general pain forCRAN.

Efforts to make the vignettes more computationally efficient (e.g.reducing the number of examples, resamples, etc.) diminished theeffectiveness of the documentation.

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 21 / 26

Page 23: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Documentation

The documentation was moved out of the package and to the githubIO page.

These pages are built using knitr whenever a new version is sent toCRAN. Some advantages are:

longer and more relevant examples are available

update schedule is under my control

dynamic documentation (e.g. D3 network graphs, JS tables)

better formatting

It currently takes about 4hr to create these (using parallel processingwhen possible).

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 22 / 26

Page 24: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Required “Optimizations” for CRAN

For example, there is one check that produces a large number of falsepositive warnings. For example:

> bwplot.diff.resamples <- function (x, data, metric = x$metric, ...) {+ ## some code

+ plotData <- subset(plotData, Metric %in% metric)

+ ## more code

+ }

will trigger a warning that “bwplot.diff.resamples: no visible

binding for global variable ’Metric’”.

The “solution” is to have a file that is sourced first in the package(e.g. aaa.R) with the line

> Metric <- NULL

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 23 / 26

Page 25: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Judging the Severity of Problems

It’s hard to tell which warnings should be ignored and which shouldnot. There is also the issue of inconsistencies related to who is “onduty” when you submit your package.

It is hard to believe that someone took the time for this:

Description Field: "My awesome R package"

R CMD check: "Malformed Description field: should contain one

or more complete sentences."

Description Field: "This package is awesome"

R CMD check: "The Description field should not start with the

package name, ’This package’ or similar."

Description Field: "Blah blah blah."

R CMD check: PASS

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 24 / 26

Page 26: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Backup Slides

Page 27: THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

roxygen2

Simplified package documentation

Automates many parts of the documentation process

Special comment block above each function

Name, description, arguments, etc.

Code and documentation are in the same source file

A must have for new packages but hard to convert existing packages

caret has 92 .Rd files

I’m not in a hurry to re-write them all in roxygen2 format

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 26 / 26