19
10 R Packages to Win Kaggle Competitions Xavier Conort Data Scientist

10 R Packages to Win Kaggle Competitions

Embed Size (px)

DESCRIPTION

10 R Packages to Win Kaggle Competitions by Xavier Conort

Citation preview

Page 1: 10 R Packages to Win Kaggle Competitions

10 R Packages to Win Kaggle Competitions

Xavier ConortData Scientist

Page 2: 10 R Packages to Win Kaggle Competitions

Previously... … now!

Page 3: 10 R Packages to Win Kaggle Competitions

Competitions that boosted my R learning curve

The Machine seems much smarter than I am at capturing complexity in the data even for simple datasets!

Humans can help the Machine too! But don’t oversimplify and discard any data.

Don’t be impatient. My best GBM had 24,500 trees with learning rate = 0.01!

SVM and feature selection matter too!

Page 4: 10 R Packages to Win Kaggle Competitions

Word n-grams and character n-grams can make a big difference

Parallel processing and big servers can help with complex feature engineering!

Still many awesome tools in R that I don’t know!

Glmnet can do a great job!

Competitions that boosted my R learning curve

Page 5: 10 R Packages to Win Kaggle Competitions

10 R Packages:Allow the Machine to Capture Complexity1. gbm2. randomForest3. e1071

Take Advantage of High-Cardinality Categorical or Text Data4. glmnet5. tauMake Your Code More Efficient 6. Matrix7. SOAR8. forEach9. doMC

10. data.table

Page 6: 10 R Packages to Win Kaggle Competitions

Capture Complexity Automatically

Page 7: 10 R Packages to Win Kaggle Competitions

1. gbmGradient Boosting Machine (Freud & Schapiro)Greg Ridgeway / Harry Southworth

Key Trick:Use gbm.more to write your own early-stopping procedure

Page 8: 10 R Packages to Win Kaggle Competitions

2. randomForestRandom Forests (Breiman & Cutler)Authors: Breiman and CutlerMaintainer: Andy Liaw

Key Trick:Importance=True for permutation importanceTune the sampsize parameter for faster computation and handling unbalanced classes

Page 9: 10 R Packages to Win Kaggle Competitions

3. e1071 3. e1071:Support Vector MachinesMaintainer: David Meyer

Key Tricks:Use kernlab (Karatzoglou, Smola and Hornik) to get heuristicWrite own pattern search

Page 10: 10 R Packages to Win Kaggle Competitions

Take Advantage of High-Cardinality Categorical or Text Features

Page 11: 10 R Packages to Win Kaggle Competitions

4. glmnetAuthors: Friedman, Hastie, Simon, TibshiraniL1 / Elasticnet / L2

Key Tricks:- Try interactions of 2 or more categorical variables- Test your code on the Kaggle: “Amazon Employ Access Challenge”

Page 12: 10 R Packages to Win Kaggle Competitions

5. tauMaintainer: Kurt HornikUsed for automating text-mining

Key Trick:Try character n-grams. They work surprisingly well!

Page 13: 10 R Packages to Win Kaggle Competitions

Make Your Code More Efficient

Page 14: 10 R Packages to Win Kaggle Competitions

6. MatrixAuthors / Maintainers: Douglas Bates and Martin Maechler

Key Trick:Use sparse.model.matrix for one-hot encoding

Page 15: 10 R Packages to Win Kaggle Competitions

7. SOARAuthor / Maintainer: Bill VenablesUsed to store large R objects in the cache and release memory

Key Trick:Once I found out about it, it made my R Experience great!(Just remember to empty your cache … )

Page 16: 10 R Packages to Win Kaggle Competitions

8. forEach and 9. doMCAuthors: Revolution Analytics

Key Trick:Use for parallel-processing to speed up computation

Page 17: 10 R Packages to Win Kaggle Competitions

10. data.tableAuthors: M Dowle, T Short and othersMaintainer: Matt Dowle

Key Trick:Essential for doing fast data aggregation operations at scale

Page 18: 10 R Packages to Win Kaggle Competitions

Don’t Forget .. Use your intuition to help the machine!

● Always compute differences / ratios of featureso This can help the Machine a lot!

● Always consider discarding features that are “too good”o They can make the Machine lazy!o An example: GE Flight Quest

Page 19: 10 R Packages to Win Kaggle Competitions

Thank you!