23
Revolution Confidential Introduction to R for Data Mining 2012 S pring Webinar S eries Joseph B. Rickert, R evolution Analytics J une 5, 2012 1

Intro to Data Mining Webinar

Embed Size (px)

DESCRIPTION

Data Mining webinar for all

Citation preview

Page 1: Intro to Data Mining Webinar

Revolution Confidential

Introduc tion to R for Data Mining

2012 S pring Webinar S eries

J os eph B . R ic kert, R evolution A nalytic sJ une 5, 2012

1

Page 2: Intro to Data Mining Webinar

Revolution ConfidentialG oals for Today’s Webinar

2

R is a serious platform for data mining

Seriously, it is not difficult to

learn enough R to do some serious data

mining

To convince you that:

Revolution R Enterprise is

is the platform for serious

data mining

Page 3: Intro to Data Mining Webinar

Revolution ConfidentialData Mining

3

Applications

Credit Scoring

Fraud Detection

Ad Optimization

Targeted Marketing

Gene Detection

Recommendation systems

Social Networks

Actions

Acquire Data

Prepare

Classify

Predict

Visualize

Optimize

Interpret

Algorithms

CART

Random Forests

SVM

KMeans

Hierarchical clustering

Ensemble Techniques

Page 4: Intro to Data Mining Webinar

Revolution Confidential

R ec ent K DD Nuggets P oll s ugges ts s o are a lot of other s erious data miners

4

What Analytics, Data mining, Big Data software you used in the past 12 months for a real project (not just evaluation) [798 voters]

Software % users in 2012 % users in 2011

R (245) 30.7% 23.3%

Excel (238) 29.8% 21.8%

Rapid-I RapidMiner (213) 26.7% 27.7%

KNIME (174) 21.8% 12.1%

Weka / Pentaho (118) 14.8% 11.8%

StatSoft Statistica (112) 14.0% 8.5%

SAS (101) 12.7% 13.6%

Rapid-I RapidAnalytics (83) 10.4% Not asked in 2011

MATLAB (80) 10.0% 7.2%

IBM SPSS Statistics (62) 7.8% 7.2%

IBM SPSS Modeler (54) 6.8% 8.3%

SAS Enterprise Miner (46) 5.8% 7.1%

Page 5: Intro to Data Mining Webinar

Revolution Confidential

WHAT DOE S IT ME A N TO L E A R N R ?

Learning R

5

Page 6: Intro to Data Mining Webinar

Revolution ConfidentialWhat does it mean to learn F renc h?

6

To read a Menu

To get around Paris on the Metro

To carry on a conversation

Page 7: Intro to Data Mining Webinar

Revolution ConfidentialL earning R

7

Levels of R Skill

R developer

R contributor

R programmer

R user

R aware

Hours of use

10 10,000

The Malcolm Gladwell “Outlier” Scale

Use a GUI

Use R Functions

Write functions

Write an R package

Write production level code

Page 8: Intro to Data Mining Webinar

Revolution Confidential

T HE S T R UC T UR E OF R FA C IL ITAT E S L E A R NING

Productive from the Get go!

8

Page 9: Intro to Data Mining Webinar

Revolution ConfidentialR is s et up to compute functions on data

9

lm <- function(x,y){. . . }

lm.modellm.model$assignlm.model$coefficientslm.model$df.residuallm.model$effectslm.model$fitted.values

.

.

.

Page 10: Intro to Data Mining Webinar

Revolution ConfidentialA little knowledge goes a long way in R R’s functional design facilitates

performing small tasks For the most part, the output of a

function depends only on the values of its arguments

calling a function multiple times with the same values of its arguments will produce the same result each time

Minimal side effects means it is much easier to understand and predict the behavior of a program

10

The trick is knowing which functions to call

Page 11: Intro to Data Mining Webinar

Revolution ConfidentialB as ic Mac hine L earning F unc tions

11

Function Library DescriptionCluster hclust stats Hierarchical cluster analysis

kmeans stats Kmeans clusteringClassifiers glm stats Logistic Regression

rpart rpart Recursive partitioning and regression trees

ksvm kernlab Support Vector MachineEnsemble ada ada Stochastic boosting

randomForest randomForest Random Forests classification and regression

Page 12: Intro to Data Mining Webinar

Revolution ConfidentialNoteworthy Data Mining P ac kages

12

Package Commentrattle A very intuitive GUI for data mining that

produces useful R codecaret Well organized and remarkably complete

collection of functions to facilitate model building for regression and classification problems

Page 13: Intro to Data Mining Webinar

Revolution Confidential

T IME TO R UN S OME C ODEDoing a lot with a little R

13

Page 14: Intro to Data Mining Webinar

Revolution ConfidentialS c ripts to run

14

Script Some key Functions0 Setup Load libraries1 Explore weather data Read.csv, plot2 Run clustering algorithms kmeans, hclust3 Basic decision tree rpart4 Boosted Tree ada5 Random Forest randomForest6 Support Vector Machine randomForest, varImpPlot7 Big Data Mortgage Default

modelrxLogit, rxKmeans

Page 15: Intro to Data Mining Webinar

Revolution ConfidentialB ig Data and R

There are some challenges: All of your data and model code must fit into

memory Big data sets as well as big models (lots of

variables) can run out of memory Parallel computation might be necessary for

models to run in a reasonable time

15

Page 16: Intro to Data Mining Webinar

Revolution ConfidentialR evoS caleR in R evolution R E nterpris e

Can help in a number of ways: Manipulate large data sets, and perhaps

aggregating data so that it will fit in memory For example, boiling down time-stamped data

like a web log to form a time series that will fit in memory

Run RevoScaleR Functions directly on big data sets Run R functions in parallel

16

Page 17: Intro to Data Mining Webinar

Revolution Confidential

Top R evoS caleR F unctions for Data Miningparallel external memory algorithms

17

Task RevoScaleR functionData processing rxDataStepDescriptive Statistics rxSumaryTables and cubes rxCube, rxCrosstabsCorrelations / covariance rxCovCor, rxCor, rxCov,

rxSSCPLinear Models rxLinModLogistic regressions rxLogitGeneralized linear models rxGlmK means clustering rxKmeansPredictions (scoring) rxPredict

Page 18: Intro to Data Mining Webinar

Revolution Confidential

WHE R E TO G O F R OM HE R E ?More than code, R is a community

18

Page 19: Intro to Data Mining Webinar

Revolution ConfidentialF inding your way around the R world

Machine Learning Data Mining Visualization Finding Packages

Task Views crantastic.org

Blogs Revolutions R-Bloggers Quick-R

Getting Help StackOverflow @RLangTip Inside-R www.rseek.org

Finding R People User Groups worldwide #rstats

19

Word Cloud for @inside_R

Page 20: Intro to Data Mining Webinar

Revolution ConfidentialL ook at s ome more s ophis ticated examples

Thomson Nguyen on the Heritage Health Prize Shannon Terry & Ben Ogorek (Nationwide Insurance):

A Direct Marketing In-Flight Forecasting System Jeffrey Breen:

Mining Twitter for Airline Consumer Sentiment Joe Rothermich: Alternative Data Sources for Measuring

Market Sentiment and Events (Using R)

20

Page 21: Intro to Data Mining Webinar

Revolution ConfidentialR evolution A nalytic s Training

21

http://www.revolutionanalytics.com/products/training/

Page 22: Intro to Data Mining Webinar

Revolution ConfidentialR eferenc es

22

Page 23: Intro to Data Mining Webinar

Revolution Confidential

Revolution Confidential

23