44
The caret Package: A Unified Interface for Predictive Models Max Kuhn Pfizer Global R&D Nonclinical Statistics Groton, CT max.kuhn@pfizer.com May 12, 2011

The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

The caret Package: A Unified Interface for PredictiveModels

Max Kuhn

Pfizer Global R&DNonclinical Statistics

Groton, [email protected]

May 12, 2011

Page 2: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Shameless Plug # 1: Courses

I’ll be teaching 2 R classes here for Predictive Analytics World.

R Bootcamp (October 16)

R for Predictive Modeling: A Hands-On Introduction (October 17)

http://www.predictiveanalyticsworld.com/newyork/2011/

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 2 / 44

Page 3: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Motivation

Theorem (No Free Lunch)

In the absence of any knowledge about the prediction problem, no modelcan be said to be uniformly better than any other

Given this, it makes sense to use a variety of different models to find onethat best fits the data

R has many packages for predictive modeling (aka machine learning)(akapattern recognition) . . .

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 3 / 44

Page 4: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Model Function ConsistencySince there are many modeling packages written by different people, thereare some inconsistencies in how models are specified and predictions aremade.

For example, many models have only one method of specifying the model(e.g. formula method only)

The table below shows the syntax to get probability estimates from severalclassification models:

obj Class Package predict Function Syntaxlda MASS predict(obj) (no options needed)glm stats predict(obj, type = "response")

gbm gbm predict(obj, type = "response", n.trees)

mda mda predict(obj, type = "posterior")

rpart rpart predict(obj, type = "prob")

Weka RWeka predict(obj, type = "probability")

LogitBoost caTools predict(obj, type = "raw", nIter)

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 4 / 44

Page 5: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

The caret Package

The caret package was developed to:

create a unified interface for modeling and prediction

streamline model tuning using resampling

provide a variety of “helper” functions and classes for day–to–daymodel building tasks

increase computational efficiency using parallel processing

First commits within Pfizer: 6/2005

First version on CRAN: 10/2007

Website: http://caret.r-forge.r-project.org

JSS Paper: www.jstatsoft.org/v28/i05/paper

4 package vignettes (82 pages total)

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 5 / 44

Page 6: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Example Data: TunedIT Music Challenge

http://tunedit.org/challenge/music-retrieval/genres

Using 191 descriptors, classify 12495 musical segments into one of 6genres: Blues, Classical, Jazz, Metal, Pop, Rock.

Use these data to predict a large test set of music segments.

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 6 / 44

Page 7: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Example Data: TunedIT Music Challenge

The predictors and class variables are contained in a data frame calledmusic.

> head(music[,1:5])

TC SC SC_V ASE1 ASE2

1 2.5788 481.45 76989.0 -0.12334 -0.11578

2 2.7195 1405.30 825380.0 -0.17655 -0.18323

3 2.5351 601.09 686240.0 -0.13940 -0.13251

4 2.4465 637.73 122580.0 -0.14995 -0.14802

5 2.5657 776.86 124010.0 -0.16863 -0.16112

6 2.7737 447.09 8531.9 -0.16128 -0.15742

> head(music$GENRE)

[1] Pop Blues Pop Jazz Jazz Classical

Levels: Blues Classical Jazz Metal Pop Rock

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 7 / 44

Page 8: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Data Splitting

createDataPartition conducts stratified random splits

> ## Create a test set with 25% of the data

> set.seed(1)

> inTrain <- createDataPartition(music$GENRE, p = .75)[[1]]

> length(inTrain)

[1] 9373

> head(inTrain)

[1] 2 7 14 20 22 47

This produces a list for each resample. The list elements are integers forthe resampled set.

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 8 / 44

Page 9: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Data Splitting

> trainDescr <- music[ inTrain, -ncol(music)]

> testDescr <- music[-inTrain, -ncol(music)]

> trainClass <- music$GENRE[ inTrain]

> testClass <- music$GENRE[-inTrain]

> prop.table(table(music$GENRE))

Blues Classical Jazz Metal Pop Rock

0.12773109 0.27563025 0.24033613 0.07394958 0.12605042 0.15630252

> prop.table(table(trainClass))

trainClass

Blues Classical Jazz Metal Pop Rock

0.12770724 0.27557879 0.24037128 0.07393577 0.12610690 0.15630001

Other functions: createFolds, createMultiFolds, createResamples

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 9 / 44

Page 10: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Data Pre–Processing Methods

preProcess calculates values that can be used to apply to any data set(e.g. training, set, unknowns).

Current methods: centering, scaling, spatial sign transformation, PCA orICA “signal extraction”, imputation (via bagging or k–nearest neighbors),Box–Cox transformations

> ## Determine means and sd's> procValues <- preProcess(trainDescr, method = c("center", "scale"))

> procValues

> ## Use the predict methods to do the adjustments

> trainScaled <- predict(procValues, trainDescr)

> testScaled <- predict(procValues, testDescr)

preProcess can also be called within other functions, such as train, foreach resampling iteration.

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 10 / 44

Page 11: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Model Tuning Using Resampling

Define sets of model parameter values to evaluate;for each parameter set do

for each resampling iteration doHold–out specific samples ;Fit the model on the remainder;Predict the hold–out samples;

endCalculate the average performance across hold–out predictions

endDetermine the optimal parameter set;

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 11 / 44

Page 12: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Model Tuning

train uses resampling to tune and/or evaluate candidate models.

> set.seed(1)

> rbfSVM <- train(x = trainDescr, y = trainClass,

+ method = "svmRadial",

+ ## center and scale

+ preProc = c("center", "scale"),

+ ## Length of default tuning parameter grid

+ tuneLength = 8,

+ ## Repeated cross-validation resampling

+ trControl = trainControl(method = "repeatedcv",

+ repeats = 5),

+ ## Pick the best model using resampled Kappa

+ metric = "Kappa",

+ ## Pass arguments to ksvm

+ fit = FALSE)

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 12 / 44

Page 13: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Model Tuning> print(rbfSVM, printCall = FALSE)

9373 samples

191 predictors

6 classes: 'Blues', 'Classical', 'Jazz', 'Metal', 'Pop', 'Rock'

Pre-processing: centered, scaled

Resampling: Cross-Validation (10 fold, repeated 5 times)

Summary of sample sizes: 8437, 8435, 8434, 8435, 8437, 8436, ...

Resampling results across tuning parameters:

C Accuracy Kappa Accuracy SD Kappa SD

0.25 0.916 0.895 0.00953 0.0119

0.5 0.938 0.923 0.00824 0.0103

1 0.956 0.945 0.00641 0.008

2 0.964 0.955 0.00614 0.00766

4 0.968 0.961 0.0061 0.00761

8 0.969 0.962 0.00623 0.00777

16 0.969 0.962 0.00633 0.0079

32 0.969 0.962 0.0063 0.00786

Tuning parameter 'sigma' was held constant at a value of 0.00518

Kappa was used to select the optimal model using the largest value.

The final values used for the model were C = 16 and sigma = 0.00518.Max Kuhn (Pfizer Global R&D) caret May 12, 2011 13 / 44

Page 14: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Model Tuning

> class(rbfSVM)

[1] "train"

> class(rbfSVM$finalModel)

[1] "ksvm"

attr(,"package")

[1] "kernlab"

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 14 / 44

Page 15: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Model Tuning

train uses as many “tricks” as possible to reduce the number ofmodels fits (e.g. using sub–models). Here, it uses the kernlab

function sigest to analytically estimate the RBF scale parameter.

Currently, there are options for 110 models (see ?train for a list)

Allows user–defined search grid, performance metrics and selectionrules

Easily integrates with any parallel processing framework that canemulate lapply

Formula and non–formula interfaces

Methods: predict, print, plot, varImp, resamples, xyplot,densityplot, histogram, stripplot, . . .

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 15 / 44

Page 16: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Plotsplot(rbfSVM, xTrans = function(x) log2(x))

Cost

Acc

ura

cy (

Re

pe

ate

d C

ross

−V

alid

atio

n)

0.92

0.93

0.94

0.95

0.96

0.97

−2 0 2 4

● ● ● ●

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 16 / 44

Page 17: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Plotsdensityplot(rbfSVM, metric = "Kappa", pch = "|")

Kappa

De

nsi

ty

0

10

20

30

40

0.94 0.96 0.98

| ||| || | |||| | |||| || | || || || | || |||| | ||| || || ||| || | ||||

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 17 / 44

Page 18: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Prediction and Performance Assessment

The predict method can be used to get results for other data sets:

> svmPred <- predict(rbfSVM, testDescr)

> str(svmPred)

Factor w/ 6 levels "Blues","Classical",..: 3 2 6 3 5 6 5 1 2 6 ...

> svmProbs <- predict(rbfSVM, testDescr, type = "prob")

> head(svmProbs)

Blues Classical Jazz Metal Pop Rock

1 0.03109657 0.51176742 0.31534778 0.05645315 0.02457019 0.06076489

2 0.00000000 0.98948148 0.01051852 0.00000000 0.00000000 0.00000000

3 0.05631158 0.03418600 0.07429845 0.14161385 0.20161666 0.49197345

4 0.09363752 0.15474426 0.32233519 0.14328794 0.14338776 0.14260733

5 0.07702743 0.09083003 0.16349012 0.17140600 0.23710395 0.26014248

6 0.06928080 0.03574326 0.08477684 0.13890564 0.18578909 0.48550437

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 18 / 44

Page 19: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Prediction and Performance Assessment

> confusionMatrix(svmPred, testClass)

Confusion Matrix and Statistics

Reference

Prediction Blues Classical Jazz Metal Pop Rock

Blues 395 0 0 3 1 1

Classical 0 841 21 0 1 2

Jazz 4 20 724 9 4 8

Metal 0 0 0 214 2 0

Pop 0 0 0 3 378 6

Rock 0 0 5 2 7 471

Overall Statistics

Accuracy : 0.9683

95% CI : (0.9615, 0.9742)

No Information Rate : 0.2758

P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9605

Mcnemar's Test P-Value : NA

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 19 / 44

Page 20: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Prediction and Performance Assessment

Statistics by Class:

Class: Blues Class: Classical Class: Jazz Class: Metal Class: Pop Class: Rock

Sensitivity 0.9900 0.9768 0.9653 0.92641 0.9618 0.9652

Specificity 0.9982 0.9894 0.9810 0.99931 0.9967 0.9947

Pos Pred Value 0.9875 0.9723 0.9415 0.99074 0.9767 0.9711

Neg Pred Value 0.9985 0.9911 0.9890 0.99415 0.9945 0.9936

Prevalence 0.1278 0.2758 0.2402 0.07399 0.1259 0.1563

Detection Rate 0.1265 0.2694 0.2319 0.06855 0.1211 0.1509

Detection Prevalence 0.1281 0.2771 0.2463 0.06919 0.1240 0.1553

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 20 / 44

Page 21: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Comparing Models

We can use the resampling results to make formal and informalcomparisons between models.

Based on the work of

Hothorn et al. “The design and analysis of benchmark experiments”.Journal of Computational and Graphical Statistics (2005) vol. 14 (3)pp. 675-699

Eugster et al. “Exploratory and inferential analysis of benchmarkexperiments”. Ludwigs-Maximilians-Universitat Munchen, Departmentof Statistics, Tech. Rep (2008) vol. 30

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 21 / 44

Page 22: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Comparing Models

> set.seed(1)

> rfFit <- train(x = trainDescr, y = trainClass,

+ method = "rf", tuneLength = 5,

+ trControl = trainControl(method = "repeatedcv",

+ repeats = 5,

+ verboseIter = FALSE),

+ metric = "Kappa")

> set.seed(1)

> plsFit <- train(x = trainDescr, y = trainClass,

+ method = "pls", tuneLength = 20,

+ preProc = c("center", "scale", "BoxCox"),

+ trControl = trainControl(method = "repeatedcv",

+ repeats = 5,

+ verboseIter = FALSE),

+ metric = "Kappa")

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 22 / 44

Page 23: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Comparing Models

> resamps <- resamples(list(rf = rfFit, pls = plsFit, svm = rbfSVM))

> print(summary(resamps))

Call:

summary.resamples(object = resamps)

Models: rf, pls, svm

Number of resamples: 50

Accuracy

Min. 1st Qu. Median Mean 3rd Qu. Max.

rf 0.9200 0.9328 0.9370 0.9370 0.9424 0.9499

pls 0.8348 0.8488 0.8554 0.8554 0.8631 0.8806

svm 0.9478 0.9648 0.9691 0.9694 0.9752 0.9819

Kappa

Min. 1st Qu. Median Mean 3rd Qu. Max.

rf 0.9003 0.9162 0.9215 0.9215 0.9282 0.9376

pls 0.7932 0.8106 0.8192 0.8190 0.8286 0.8507

svm 0.9350 0.9561 0.9615 0.9619 0.9691 0.9774

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 23 / 44

Page 24: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Comparing Models

> diffs <- diff(resamps, metric = "Kappa")

> print(summary(diffs))

Call:

summary.diff.resamples(object = diffs)

p-value adjustment: bonferroni

Upper diagonal: estimates of the difference

Lower diagonal: p-value for H0: difference = 0

Kappa

rf pls svm

rf 0.10245 -0.04043

pls < 2.2e-16 -0.14288

svm < 2.2e-16 < 2.2e-16

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 24 / 44

Page 25: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Parallel Coordinate Plotsparallel(resamps, metric = "Kappa")

Kappa

pls

rf

svm

0.8 0.85 0.9 0.95

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 25 / 44

Page 26: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Box Plotsbwplot(resamps, metric = "Kappa")

pls

rf

svm

0.80 0.85 0.90 0.95

●●

Kappa

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 26 / 44

Page 27: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Dot Plots of Average Differencesdotplot(diffs)

Difference in Kappa Confidence Level 0.983 (multiplicity adjusted)

pls − svm

rf − pls

rf − svm

−0.15 −0.10 −0.05 0.00 0.05 0.10

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 27 / 44

Page 28: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Feature Selection

There are many predictive models with built–in feature selection (e.g.trees, the lasso, MARS, etc).

caret contains a few functions for supervised feature selection via“wrappers”.

Two wrappers techniques in caret are::

recursive feature selection (RFE)

filtering using simple, univariate statistics

This can be tricky and can be fraught with bias.

See: Ambroise and McLachlan (2002) for an example

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 28 / 44

Page 29: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Recursive Feature Selection

This is basically backwards selection.

We rank the predictors by importance, then cull the least important.

We create a performance profile across the subset size and pick the best

The final model is refit using only the subset.

The feature selection step must be cross–validated!

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 29 / 44

Page 30: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Recursive Feature Elimination

for Each Resampling Iteration doPartition original data into training and hold–back sets via resampling ;Train the model on the training set using all predictors;Predict the held–back samples;Calculate variable importance or rankings;for Each subset size Si , i = 1 . . .S do

Keep the Si most important variables;Train the model on the training set using Si predictors;Predict the held–back samples;

end

endCalculate the performance profile over the Si using the held–back samples;Determine the appropriate number of predictors;Estimate the final list of predictors to keep in the final model;Fit the final model based on the optimal Si using the original data set;

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 30 / 44

Page 31: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Recursive Feature Selection

The rfe function is a framework for doing this. There are severalpre–defined functions for certain models (and a wrapper for train)

For each subset, let’s run a few regression models for illustration

> data(BloodBrain)

> varSizes <- c(2:25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 80, 80)

> x <- bbbDescr[,-nearZeroVar(bbbDescr)]

> x <- x[, -findCorrelation(cor(x), .9)]

> set.seed(1)

> lmProfile <- rfe(x, logBBB,

+ sizes = varSizes,

+ rfeControl = rfeControl(functions = lmFuncs,

+ number = 200,

+ verbose = FALSE))

> rfProfile <- rfe(x, logBBB,

+ sizes = varSizes,

+ rfeControl = rfeControl(functions = rfFuncs,

+ number = 200,

+ verbose = FALSE))

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 31 / 44

Page 32: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Backwards Selection Results

Variables

RM

SE

0.6

0.7

0.8

0.9

1.0

1.1

0 20 40 60 80

●●

●●●●●●●●●●●●●●●●●●●●●

●●

●● ●

Linear RegRandom Forests

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 32 / 44

Page 33: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Opportunities for Parallel Processing

Recall the algorithm for selecting models via resampling:

Define sets of model parameter values to evaluate;for each parameter set do

for each resampling iteration doHold–out specific samples ;Fit the model on the remainder;Predict the hold–out samples;

endCalculate the average performance across hold–out predictions

endDetermine the optimal parameter set;

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 33 / 44

Page 34: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Opportunities for Parallel Processing

In this process, M models are fit to B resampled data sets.

There is (usually∗) no connection between these models, so they could berun within different processes on the same computer or over separatecomputers.

Can we get any benefit from parallel processing?

∗There are some exceptions where sub–models are evaluated withoutfurther re–fitting. For example, if we can fit a PLS model with 10components, we can get the results from models with 1–9 components forfree. We’ll call this the “sub–model” trick.

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 34 / 44

Page 35: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

An Example – Boosted Trees

We trained a medium sized data set (n = 4, 500) to tune a gradientboosting machine (GBM) model sequentially and in parallel.

We fit models with four different values of the interaction depth and 10different values for the number of boosting iterations.

It turns out that, for each value of the interaction depth, we can fit onemodel with the largest number of iterations and get the predictions fromsmaller models at no cost.

This means we need to fit four models (with different interaction depths)for 50 bootstrap samples. We’ll partition these 200 model fits ontodifferent processes in a few ways to see if parallelization helps.

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 35 / 44

Page 36: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Execution Times – An Example – Support Vector Machines

SVM regression models with 5 candidate values of the cost parameter with50 bootstrap iterations can be tested on the same data.

caret uses sub–models wherever possible to be efficient but, unlikeboosted trees, support vector machines cannot be exploited in this way.

The GBM and SVM computations were performed using sequentialprocess and parallel processing with 1 to 16 “worker nodes’.

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 36 / 44

Page 37: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Execution Time Results

#Processors

Trai

ning

Tim

e

2040

6080

5 10 15

●●●

●●●

●●●

●●●

●●●

●●●

●●●●

●●●

●●●

●●●

●●●

●●

●●● ●●●● ●●● ●●●

GBM

5 10 15

1015

2025

3035

●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●●●

●●●

●●● ●●●●●● ●●● ●●● ●●●

SVM

parallel sequential●

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 37 / 44

Page 38: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Speedups

speedup = Sequential Time / Parallel Time

#Processors

Spe

edup

1

2

3

4

5

5 10 15

●●●●●●

●●●

●●●

●●●

●●●

●●●

●●●●

●●●

●●●

●●●

●●

●●

●●●

●●●●●●●

●●●

GBM SVM●

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 38 / 44

Page 39: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Results

There is a benefit to adding more workers for these calculations.

The optimal speedup with be W where W is the number of workers. Weare not optimal, but we can cut the execution time down by 4–5 fold.

The SVM model benefited more than the GBM model, perhaps since GBMwas fitting less models (using sub–models).

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 39 / 44

Page 40: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Other Functions and Classes

nearZeroVar: a function to remove predictors that are sparse andhighly unbalanced

findCorrelation: a function to remove the optimal set ofpredictors to achieve low pair–wise correlations

predictors: class for determining which predictors are included inthe prediction equations (e.g. rpart, earth, lars models) (currently57 methods)

confusionMatrix, sensitivity, specificity, posPredValue,negPredValue: classes for assessing classifier performance

varImp: classes for assessing the aggregate effect of a predictor onthe model equations (currently 20 methods)

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 40 / 44

Page 41: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Other Functions and Classes

knnreg: nearest–neighbor regression

plsda, splsda: PLS discriminant analysis

icr: independent component regression

pcaNNet: nnet:::nnet with automatic PCA pre–processing step

bagEarth, bagFDA: bagging with MARS and FDA models

normalize2Reference: RMA–like processing of Affy arrays using atraining set

spatialSign: class for transforming numeric data (x ′ = x/||x ||)maxDissim: a function for maximum dissimilarity sampling

featurePlot: a wrapper for several lattice functions

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 41 / 44

Page 42: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Shameless Plug # 2: Other Packages

A few others that I’m working on...

sparseLDA: Lasso–type regularization for LDA

Cubist: Quinlan’s model trees

C5.0: Quinlan’s decision trees (I could use some C help here)

FuseBox: a framework for combining ensembles of models

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 42 / 44

Page 43: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Thanks

Kirk Mettler, Bruno and the NYC Predictive Analytics Organizers

R Core

Pfizer’s Statistics leadership for providing the time and support to create Rpackages

caret contributors: Jed Wing, Steve Weston, Andre Williams, ChrisKeefer and Allan Engelhardt

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 43 / 44

Page 44: The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Session Info

R version 2.11.1 (2010-05-31), x86_64-apple-darwin9.8.0

Base packages: base, datasets, graphics, grDevices, methods, splines, stats,tools, utils

Other packages: caret 4.87, class 7.3-2, cluster 1.12.3, codetools 0.2-2,digest 0.4.2, e1071 1.5-24, gbm 1.6-3.1, kernlab 0.9-12, lattice 0.18-8,plyr 1.2.1, reshape 0.8.3, survival 2.35-8, weaver 1.16.0

Loaded via a namespace (and not attached): grid 2.11.1

This presentation was created with a MacPro using LATEXand R’s Sweave

function at 16:02 on Wednesday, May 11, 2011.

Max Kuhn (Pfizer Global R&D) caret May 12, 2011 44 / 44