30
A Data Mining Workshop Excavating Knowledge from Data Introducing Data Mining using R [email protected] Data Scientist Australian Taxation Office Adjunct Professor, Australian National University Adjunct Professor, University of Canberra Fellow, Institute of Analytics Professionals of Australia [email protected] http://datamining.togaware.com Visit: http://onepager.togaware.com for Workshop Notes http: // togaware. com Copyright 2014, [email protected] 1/174 Workshop Overview 1 R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright 2014, [email protected] 2/174 R: A Language for Data Mining Workshop Overview 1 R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright 2014, [email protected] 3/174 R: A Language for Data Mining Installing R Installing R and Rattle First task is to install R As free/libre open source software (FLOSS or FOSS), R and Rattle are available to all, with no limitations on our freedom to use and share the software, except to share and share alike. Visit CRAN at http://cran.rstudio.com Visit Rattle at http://rattle.togaware.com Linux: Install packages (Ubuntu is recommended) $ wajig install r-recommended r-cran-rattle Windows: Download and install from CRAN MacOSX: Download and install from CRAN http: // togaware. com Copyright 2014, [email protected] 4/174 R: A Language for Data Mining Data Science with R? Why do Data Science with R? Most widely used Data Mining and Machine Learning Package Machine Learning Statistics Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software . . . all modern statistical approaches . . . many/most machine learning algorithms ... opportunity to readily add new algorithms That is important for us in the research community Get our algorithms out there and being used—impact!!! http: // togaware. com Copyright 2014, [email protected] 5/174 R: A Language for Data Mining Data Science with R? Why do Data Science with R? Most widely used Data Mining and Machine Learning Package Machine Learning Statistics Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software . . . all modern statistical approaches . . . many/most machine learning algorithms . . . opportunity to readily add new algorithms That is important for us in the research community Get our algorithms out there and being used—impact!!! http: // togaware. com Copyright 2014, [email protected] 5/174

Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

A Data Mining Workshop

Excavating Knowledge from DataIntroducing Data Mining using R

[email protected]

Data ScientistAustralian Taxation Office

Adjunct Professor, Australian National UniversityAdjunct Professor, University of Canberra

Fellow, Institute of Analytics Professionals of Australia

[email protected]

http://datamining.togaware.com

Visit: http://onepager.togaware.com for Workshop Noteshttp: // togaware. com Copyright© 2014, [email protected] 1/174

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, [email protected] 2/174

R: A Language for Data Mining

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, [email protected] 3/174

R: A Language for Data Mining Installing R

Installing R and Rattle

First task is to install R

As free/libre open source software (FLOSS or FOSS), R andRattle are available to all, with no limitations on our freedom touse and share the software, except to share and share alike.

Visit CRAN at http://cran.rstudio.com

Visit Rattle at http://rattle.togaware.com

Linux: Install packages (Ubuntu is recommended)$ wajig install r-recommended r-cran-rattle

Windows: Download and install from CRAN

MacOSX: Download and install from CRAN

http: // togaware. com Copyright© 2014, [email protected] 4/174

R: A Language for Data Mining Data Science with R?

Why do Data Science with R?

Most widely used Data Mining and Machine Learning Package

Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!

Free (Libre) Open Source Statistical Software

. . . all modern statistical approaches

. . . many/most machine learning algorithms

. . . opportunity to readily add new algorithms

That is important for us in the research communityGet our algorithms out there and being used—impact!!!

http: // togaware. com Copyright© 2014, [email protected] 5/174

R: A Language for Data Mining Data Science with R?

Why do Data Science with R?

Most widely used Data Mining and Machine Learning Package

Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!

Free (Libre) Open Source Statistical Software

. . . all modern statistical approaches

. . . many/most machine learning algorithms

. . . opportunity to readily add new algorithms

That is important for us in the research communityGet our algorithms out there and being used—impact!!!

http: // togaware. com Copyright© 2014, [email protected] 5/174

Page 2: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

R: A Language for Data Mining Data Science with R?

Why do Data Science with R?

Most widely used Data Mining and Machine Learning Package

Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!

Free (Libre) Open Source Statistical Software

. . . all modern statistical approaches

. . . many/most machine learning algorithms

. . . opportunity to readily add new algorithms

That is important for us in the research communityGet our algorithms out there and being used—impact!!!

http: // togaware. com Copyright© 2014, [email protected] 5/174

R: A Language for Data Mining Data Science with R?

Why do Data Science with R?

Most widely used Data Mining and Machine Learning Package

Machine LearningStatisticsSoftware Engineering and Programming with DataBut not the nicest of languages for a Computer Scientist!

Free (Libre) Open Source Statistical Software

. . . all modern statistical approaches

. . . many/most machine learning algorithms

. . . opportunity to readily add new algorithms

That is important for us in the research communityGet our algorithms out there and being used—impact!!!

http: // togaware. com Copyright© 2014, [email protected] 5/174

R: A Language for Data Mining Popularity of R?

How Popular is R? Discussion List Traffic

Monthly email traffic on software’s main discussion list.

Source: http://r4stats.com/articles/popularity/http: // togaware. com Copyright© 2014, [email protected] 6/174

R: A Language for Data Mining Popularity of R?

How Popular is R? Discussion Topics

Number of discussions on popular QandA forums 2013.

Source: http://r4stats.com/articles/popularity/

http: // togaware. com Copyright© 2014, [email protected] 7/174

R: A Language for Data Mining Popularity of R?

How Popular is R? R versus SAS

Number of R/SAS related posts to Stack Overflow by week.

Source: http://r4stats.com/articles/popularity/

http: // togaware. com Copyright© 2014, [email protected] 8/174

R: A Language for Data Mining Popularity of R?

How Popular is R? Professional Forums

Registered for the main discussion group for each software.

Source: http://r4stats.com/articles/popularity/

http: // togaware. com Copyright© 2014, [email protected] 9/174

Page 3: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

R: A Language for Data Mining Popularity of R?

How Popular is R? Used in Analytics

Competitions

Software used in data analysis competitions in 2011.

Source: http://r4stats.com/articles/popularity/

http: // togaware. com Copyright© 2014, [email protected] 10/174

R: A Language for Data Mining Popularity of R?

How Popular is R? User Survey

Rexer Analytics Survey 2010 results for data mining/analytic tools.

Source: http://r4stats.com/articles/popularity/

http: // togaware. com Copyright© 2014, [email protected] 11/174

R: A Language for Data Mining Popularity of R?

What is R?

R — The Video

A 90 Second Promo from Revolution Analytics

http://www.revolutionanalytics.com/what-is-open-source-r/

http: // togaware. com Copyright© 2014, [email protected] 12/174

Data Mining, Rattle, and R

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, [email protected] 13/174

Data Mining, Rattle, and R Definition

Data Mining

A data driven analysis to uncover otherwise unknown but usefulpatterns in large datasets, to discover new knowledge and to developpredictive models, turning data and information into knowledge and(one day perhaps) wisdom, in a timely manner.

http: // togaware. com Copyright© 2014, [email protected] 14/174

Data Mining, Rattle, and R Definition

Data Mining

Application of

Machine LearningStatisticsSoftware Engineering and Programming with DataEffective Communications and Intuition

. . . to Datasets that vary byVolume, Velocity, Variety, Value, Veracity

. . . to discover new knowledge

. . . to improve business outcomes

. . . to deliver better tailored services

http: // togaware. com Copyright© 2014, [email protected] 15/174

Page 4: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Data Mining, Rattle, and R Definition

Data Mining in Research

Health ResearchAdverse reactions using linked Pharmaceutical, GeneralPractitioner, Hospital, Pathology datasets.

AstronomyMicrolensing events in the Large Magellanic Cloud of severalmillion observed stars (out of 10 billion).

PsychologyInvestigation of age-of-onset for Alzheimer’s disease from 75variables for 800 people.

Social SciencesSurvey evaluation. Social network analysis - identifying keyinfluencers.

http: // togaware. com Copyright© 2014, [email protected] 16/174

Data Mining, Rattle, and R Definition

Data Mining in Government

Australian Taxation OfficeLodgment ($110M)Tax Havens ($150M)Tax Fraud ($250M)

Immigration and Border ControlCheck passengers before boarding

Health and Human ServicesDoctor shoppersOver servicing

http: // togaware. com Copyright© 2014, [email protected] 17/174

Data Mining, Rattle, and R Definition

The Business of Data Mining

SAS has annual revenues of $3B (2013)

IBM bought SPSS for $1.2B (2009)

Analytics is >$100B business and >$320B by 2020

Amazon, eBay/PayPal, Google, Facebook, LinkedIn, . . .

Shortage of 180,000 data scientists in US in 2018 (McKinsey) . . .

http: // togaware. com Copyright© 2014, [email protected] 18/174

Data Mining, Rattle, and R Algorithms

Basic Tools: Data Mining Algorithms

Cluster Analysis (kmeans, wskm)

Association Analysis (arules)

Linear Discriminant Analysis (lda)

Logistic Regression (glm)

Decision Trees (rpart, wsrpart)

Random Forests (randomForest, wsrf)

Boosted Stumps (ada)

Neural Networks (nnet)

Support Vector Machines (kernlab)

. . .

That’s a lot of tools to learn in R!Many with different interfaces and options.

http: // togaware. com Copyright© 2014, [email protected] 19/174

Data Mining, Rattle, and R A GUI for Data Mining

Why a GUI?

Statistics can be complex and traps await

So many tools in R to deliver insights

Effective analyses should be scripted

Scripting also required for repeatability

R is a language for programming with data

How to remember how to do all of this in R?How to skill up 150 data analysts with Data Mining?

http: // togaware. com Copyright© 2014, [email protected] 20/174

Data Mining, Rattle, and R A GUI for Data Mining

Users of Rattle

Today, Rattle is used world wide in many industries

Health analytics

Customer segmentation and marketing

Fraud detection

Government

It is used by

Universities to teach Data Mining

Within research projects for basic analyses

Consultants and Analytics Teams across business

It is and will remain freely available.

CRAN and http://rattle.togaware.com

http: // togaware. com Copyright© 2014, [email protected] 21/174

Page 5: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Data Mining, Rattle, and R Setting Things Up

Installation

Rattle is built using R

Need to download and install R from cran.r-project.org

Recommend also install RStudio from www.rstudio.org

Then start up RStudio and install Rattle:

install.packages("rattle")

Then we can start up Rattle:

rattle()

Required packages are loaded as needed.

http: // togaware. com Copyright© 2014, [email protected] 22/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Startup

http: // togaware. com Copyright© 2014, [email protected] 23/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Loading Data

http: // togaware. com Copyright© 2014, [email protected] 24/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Explore Distribution

http: // togaware. com Copyright© 2014, [email protected] 25/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Explore Correlations

http: // togaware. com Copyright© 2014, [email protected] 26/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Hierarchical Cluster

http: // togaware. com Copyright© 2014, [email protected] 27/174

Page 6: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Decision Tree

http: // togaware. com Copyright© 2014, [email protected] 28/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Decision Tree Plot

http: // togaware. com Copyright© 2014, [email protected] 29/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Random Forest

http: // togaware. com Copyright© 2014, [email protected] 30/174

Data Mining, Rattle, and R Tour

A Tour Thru Rattle: Risk Chart

0.10.20.30.40.50.60.70.80.9

Risk Scores

22%

1

2

3

4

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Per

form

ance

(%

)

RainTomorrow (92%)

Rain in MM (97%)

Precision

Risk Chart Random Forest weather.csv [test] RainTomorrow

http: // togaware. com Copyright© 2014, [email protected] 31/174

Data Mining, Rattle, and R Programming with Data

Data Scientists are Programmers of Data

But. . .

Data scientists are programmers of data

A GUI can only do so much

R is a powerful statistical language

Data Scientists Desire. . .

Scripting

Transparency

Repeatability

Sharing

http: // togaware. com Copyright© 2014, [email protected] 32/174

Data Mining, Rattle, and R Programming with Data

From GUI to CLI — Rattle’s Log Tab

http: // togaware. com Copyright© 2014, [email protected] 33/174

Page 7: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Data Mining, Rattle, and R Programming with Data

From GUI to CLI — Rattle’s Log Tab

http: // togaware. com Copyright© 2014, [email protected] 34/174

Data Mining, Rattle, and R Programming with Data

Interface Notes

Work through the tabs from left to right

After setting up a tab we need to Execute it

Projects save the current Rattle state

Projects can be restored at a later time

http: // togaware. com Copyright© 2014, [email protected] 35/174

Data Mining, Rattle, and R The Power of Free/Libre and Open Source Software

Tools

Ubuntu GNU/Linux operating system

Feature rich toolkit, up-to-date, easy to install, FLOSS

RStudio

Easy to use integrated development environment, FLOSSPowerful alternative is Emacs (Speaks Statistics), FLOSS

R Statistical Software Language

Extensive, powerful, thousands of contributors, FLOSS

KnitR and LATEX

Produce beautiful documents, easily reproducible, FLOSS

http: // togaware. com Copyright© 2014, [email protected] 36/174

Data Mining, Rattle, and R Ubuntu

Using Ubuntu

Desktop Operating System (GNU/Linux)

Replacing Windows and OSX

The GNU Tool Suite based on Unix — significant heritage

Multiple specialised single task tools, working well together

Compared to single application trying to do it all

Powerful data processing from the command line:grep, awk, head, tail, wc, sed, perl, python, most, diff, make,paste, join, patch, . . .

For interacting with R — start up RStudio from the Dash

http: // togaware. com Copyright© 2014, [email protected] 37/174

Data Mining, Rattle, and R RStudio Interface

RStudio—The Default Three Panels

http: // togaware. com Copyright© 2014, [email protected] 38/174

Data Mining, Rattle, and R RStudio Interface

RStudio—With R Script File—Editor Panel

http: // togaware. com Copyright© 2014, [email protected] 39/174

Page 8: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Data Mining, Rattle, and R Simple Plots

Scatterplot—R Code

Our first little bit of R code:

Load a couple of packages into the R library

library(rattle) # Provides the weather dataset

library(ggplot2) # Provides the qplot() function

Then produce a quick plot using qplot()

ds <- weather

qplot(MinTemp, MaxTemp, data=ds)

http: // togaware. com Copyright© 2014, [email protected] 40/174

Data Mining, Rattle, and R Simple Plots

Scatterplot—Plot

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●●

● ●

●●●

●●

● ●●

● ●

●●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

10

20

30

0 10 20MinTemp

Max

Tem

p

http: // togaware. com Copyright© 2014, [email protected] 41/174

Data Mining, Rattle, and R Simple Plots

Scatterplot—RStudio

http: // togaware. com Copyright© 2014, [email protected] 42/174

Data Mining, Rattle, and R Installing Packages

Missing Packages–Tools→Install Packages. . .

http: // togaware. com Copyright© 2014, [email protected] 43/174

Data Mining, Rattle, and R Installing Packages

RStudio—Installing ggplot2

http: // togaware. com Copyright© 2014, [email protected] 44/174

Data Mining, Rattle, and R RStudio Shortcuts

RStudio—Keyboard Shortcuts

These will become very useful!

Editor:

Ctrl-Enter will send the line of code to the R consoleCtrl-2 will move the cursor to the Console

Console:

UpArrow will cycle through previous commandsCtrl-UpArrow will search previous commandsTab will complete function names and list the argumentsCtrl-1 will move the cursor to the Editor

http: // togaware. com Copyright© 2014, [email protected] 45/174

Page 9: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Data Mining, Rattle, and R Basic R Commands

Basic R

library(rattle) # Load the weather dataset.

head(weather) # First 6 observations of the dataset.

## Date Location MinTemp MaxTemp Rainfall Evapora...

## 1 2007-11-01 Canberra 8.0 24.3 0.0 ...

## 2 2007-11-02 Canberra 14.0 26.9 3.6 ...

## 3 2007-11-03 Canberra 13.7 23.4 3.6 ...

....

str(weather) # Struncture of the variables in the dataset.

## 'data.frame': 366 obs. of 24 variables:

## $ Date : Date, format: "2007-11-01" "2007-11-...

## $ Location : Factor w/ 46 levels "Adelaide","Alba...

## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ...

....

http: // togaware. com Copyright© 2014, [email protected] 46/174

Data Mining, Rattle, and R Basic R Commands

Basic R

summary(weather) # Univariate summary of the variables.

## Date Location MinTemp ...

## Min. :2007-11-01 Canberra :366 Min. :-5.30 ...

## 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.30 ...

## Median :2008-05-01 Albany : 0 Median : 7.45 ...

## Mean :2008-05-01 Albury : 0 Mean : 7.27 ...

## 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.50 ...

## Max. :2008-10-31 BadgerysCreek: 0 Max. :20.90 ...

## (Other) : 0 ...

## Rainfall Evaporation Sunshine WindGust...

## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW : ...

## 1st Qu.: 0.00 1st Qu.: 2.20 1st Qu.: 5.95 NNW : ...

## Median : 0.00 Median : 4.20 Median : 8.60 E : ...

## Mean : 1.43 Mean : 4.52 Mean : 7.91 WNW : ...

## 3rd Qu.: 0.20 3rd Qu.: 6.40 3rd Qu.:10.50 ENE : ...

....

http: // togaware. com Copyright© 2014, [email protected] 47/174

Data Mining, Rattle, and R Visualising Data

Visual Summaries—Add A Little Colour

qplot(Humidity3pm, Pressure3pm, colour=RainTomorrow, data=ds)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

995

1005

1015

1025

1035

25 50 75 100Humidity3pm

Pres

sure

3pm

RainTomorrow

No

Yes

http: // togaware. com Copyright© 2014, [email protected] 48/174

Data Mining, Rattle, and R Visualising Data

Visual Summaries—Careful with Categorics

qplot(WindGustDir, Pressure3pm, data=ds)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

995

1005

1015

1025

1035

N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NAWindGustDir

Pres

sure

3pm

http: // togaware. com Copyright© 2014, [email protected] 49/174

Data Mining, Rattle, and R Visualising Data

Visual Summaries—Add A Little Jitter

qplot(WindGustDir, Pressure3pm, data=ds, geom="jitter")

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

995

1005

1015

1025

1035

N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NAWindGustDir

Pres

sure

3pm

http: // togaware. com Copyright© 2014, [email protected] 50/174

Data Mining, Rattle, and R Visualising Data

Visual Summaries—And Some Colour

qplot(WindGustDir, Pressure3pm, data=ds, colour=WindGustDir, geom="jitter")

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

1005

1015

1025

1035

N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NAWindGustDir

Pres

sure

3pm

http: // togaware. com Copyright© 2014, [email protected] 51/174

Page 10: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Data Mining, Rattle, and R Help

Getting Help—Precede Command with ?

http: // togaware. com Copyright© 2014, [email protected] 52/174

Loading, Cleaning, Exploring Data in Rattle

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, [email protected] 53/174

Loading, Cleaning, Exploring Data in Rattle

Loading Data

http: // togaware. com Copyright© 2014, [email protected] 54/174

Loading, Cleaning, Exploring Data in Rattle

Exploring Data

http: // togaware. com Copyright© 2014, [email protected] 55/174

Loading, Cleaning, Exploring Data in Rattle

Test Data

http: // togaware. com Copyright© 2014, [email protected] 56/174

Loading, Cleaning, Exploring Data in Rattle

Transform Data

http: // togaware. com Copyright© 2014, [email protected] 57/174

Page 11: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Descriptive Data Mining

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, [email protected] 58/174

Descriptive Data Mining Cluster Analysis

What is Cluster Analysis?

Cluster: a collection of observations

Similar to one another within the same clusterDissimilar to the observations in other clusters

Cluster analysisGrouping a set of data observations into classes

Clustering is unsupervised classification: no predefinedclasses—descriptive data mining.

Typical applications

As a stand-alone tool to get insight into data distributionAs a preprocessing step for other algorithms

http: // togaware. com Copyright© 2014, [email protected] 59/174

Descriptive Data Mining Cluster Analysis

What Is Good Clustering?

High Quality:

high intra-class similaritylow inter-class similarity

The Quality depends on:

similarity measurealgorithm for searching

Depends on the opinion of the user, and the algorithm’sability to discover hidden patterns that are of interest to theuser.

http: // togaware. com Copyright© 2014, [email protected] 60/174

Descriptive Data Mining Cluster Analysis

Major Clustering Approaches

Partitioning algorithms (kmeans, pam, clara, fanny): Constructvarious partitions and then evaluate them by some criterion. Afixed number of clusters, k, is generated. Start with an initial(perhaps random) cluster.

Hierarchical algorithms: (hclust, agnes, diana) Create ahierarchical decomposition of the set of observations using somecriterion

Density-based algorithms: based on connectivity and densityfunctions

Grid-based algorithms: based on a multiple-level granularitystructure

Model-based algorithms: (mclust for mixture of Gaussians) Amodel is hypothesized for each of the clusters and the idea is tofind the best fit of that model

http: // togaware. com Copyright© 2014, [email protected] 61/174

Descriptive Data Mining Cluster Analysis

KMeans Clustering

http: // togaware. com Copyright© 2014, [email protected] 62/174

Descriptive Data Mining Association Rules

Association Rule Mining

An unsupervised learning algorithm—descriptive data mining.

Identify items (patterns) that occur frequently together in agiven set of data.

Patterns = associations, correlations, causal structures (Rules).

Data = sets of items in . . .

transactional databaserelational databasecomplex information repositories

Rule: Body → Head [support, confidence]

http: // togaware. com Copyright© 2014, [email protected] 63/174

Page 12: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Descriptive Data Mining Association Rules

Typical Applications

Link analysis

Market basket analysis

Cross marketing

Customers who purchase . . .

http: // togaware. com Copyright© 2014, [email protected] 64/174

Descriptive Data Mining Association Rules

Examples

Friday ∩ Nappies → Beer[0.5%, 60%]

Age ∈ [20, 30] ∩ Income ∈ [20K , 30K ]→ MP3Player[2%, 60%]

Maths ∩ CS → HDinCS[1%, 75%]

Gladiator ∩ Patriot → Sixth Sense[0.1%, 90%]

Statins ∩ Peritonitis → Chronic Renal Failure[0.1%, 32%]

http: // togaware. com Copyright© 2014, [email protected] 65/174

Descriptive Data Mining Health Insurance Commission

Health Insurance Commision

Associations on episode database for pathology services

6.8 million records X 120 attributes (3.5GB)15 months preprocessing then 2 weeks data mining

Goal: find associations between tests

cmin = 50% and smin = 1%, 0.5%, 0.25%(1% of 6.8 million = 68,000)Unexpected/unnecessary combination of services

Refuse cover saves $550,000 per year

http: // togaware. com Copyright© 2014, [email protected] 66/174

Descriptive Data Mining Health Insurance Commission

Association Rules

http: // togaware. com Copyright© 2014, [email protected] 67/174

Predictive Data Mining: Decision Trees

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, [email protected] 68/174

Predictive Data Mining: Decision Trees Basics

Predictive Modelling: Classification

Goal of classification is to build models (sentences) in aknowledge representation (language) from examples of pastdecisions.

The model is to be used on unseen cases to make decisions.

Often referred to as supervised learning.

Common approaches: decision trees; neural networks; logisticregression; support vector machines.

http: // togaware. com Copyright© 2014, [email protected] 69/174

Page 13: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Predictive Data Mining: Decision Trees Basics

Language: Decision Trees

Knowledge representation: A flow-chart-like tree structure

Internal nodes denotes a test on a variable

Branch represents an outcome of the test

Leaf nodes represent class labels or class distribution

Y

Y N

Age

Gender

Male Female

< 42 > 42

http: // togaware. com Copyright© 2014, [email protected] 70/174

Predictive Data Mining: Decision Trees Basics

Tree Construction: Divide and Conquer

Decision tree induction is an example of a recursive partitioningalgorithm: divide and conquer.

At start, all the training examples are at the root

Partition examples recursively based on selected variables

+ +

+

+

+

+

+

+

++

++

+ −

_

+

−−

+

<42

>42

+

+

Males

Females

+

+

+−

−+ +

++

+

−−

+

+

−+

+

http: // togaware. com Copyright© 2014, [email protected] 71/174

Predictive Data Mining: Decision Trees Algorithm

Algorithm for Decision Tree Induction

A greedy algorithm: takes the best immediate (local) decisionwhile building the overall model

Tree constructed top-down, recursive, divide-and-conquer

Begin with all training examples at the root

Data is partitioned recursively based on selected variables

Select variables on basis of a measure

Stop partitioning when?

All samples for a given node belong to the same classThere are no remaining variables for further partitioning –majority voting is employed for classifying the leafThere are no samples left

http: // togaware. com Copyright© 2014, [email protected] 72/174

Predictive Data Mining: Decision Trees Algorithm

Basic Motivation: Entropy

We are trying to predict output Y (e.g., Yes/No) from input X .

A random data set may have high entropy:

Y is from a uniform distributiona frequency distribution would be flat!a sample will include uniformly random values of Y

A data set with low entropy:

Y ’s distribution will be very skeweda frequency distribution will have a single peaka sample will predominately contain just Yes or just No

Work towards reducing the amount of entropy in the data!

http: // togaware. com Copyright© 2014, [email protected] 73/174

Predictive Data Mining: Decision Trees Algorithm

Variable Selection Measure: Entropy

Information gain (ID3/C4.5)

Select the variable with the highest information gain

Assume there are two classes: P and N

Let the data S contain p elements of class P and n elements ofclass N

The amount of information, needed to decide if an arbitraryexample in S belongs to P or N is defined as

IE (p, n) = − p

p + nlog2

p

p + n− n

p + nlog2

n

p + n

http: // togaware. com Copyright© 2014, [email protected] 74/174

Predictive Data Mining: Decision Trees Algorithm

Variable Selection Measure: Gini

Gini index of impurity – traditional statistical measure – CART

Measure how often a randomly chosen observation is incorrectlyclassified if it were randomly classified in proportion to the actualclasses.

Calculated as the sum of the probability of each observationbeing chosen times the probability of incorrect classification,equivalently:

IG (p, n) = 1− (p2 + (1− p)2)

As with Entropy, the Gini measure is maximal when the classesare equally distributed and minimal when all observations are inone class or the other.

http: // togaware. com Copyright© 2014, [email protected] 75/174

Page 14: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Predictive Data Mining: Decision Trees Algorithm

Variable Selection Measure

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Proportion of Positives

Mea

sure Formula

Info

Gini

Variable Importance Measure

http: // togaware. com Copyright© 2014, [email protected] 76/174

Predictive Data Mining: Decision Trees Algorithm

Information Gain

Now use variable A to partition S into v cells: {S1,S2, . . . ,Sv}If Si contains pi examples of P and ni examples of N, theinformation now needed to classify objects in all subtrees Si is:

E (A) =v∑

i=1

pi + ni

p + nI (pi , ni )

So, the information gained by branching on A is:

Gain(A) = I (p, n)− E (A)

So choose the variable A which results in the greatest gain ininformation.

http: // togaware. com Copyright© 2014, [email protected] 77/174

Predictive Data Mining: Decision Trees Building Decision Trees

Startup Rattle

library(rattle)

rattle()

http: // togaware. com Copyright© 2014, [email protected] 78/174

Predictive Data Mining: Decision Trees Building Decision Trees

Load Example Weather Dataset

Click on the Execute button and an example dataset is offered.

Click on Yes to load the weather dataset.

http: // togaware. com Copyright© 2014, [email protected] 79/174

Predictive Data Mining: Decision Trees Building Decision Trees

Summary of the Weather Dataset

A summary of the weather dataset is displayed.

http: // togaware. com Copyright© 2014, [email protected] 80/174

Predictive Data Mining: Decision Trees Building Decision Trees

Model Tab — Decision Tree

Click on the Model tab to display the modelling options.

http: // togaware. com Copyright© 2014, [email protected] 81/174

Page 15: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Predictive Data Mining: Decision Trees Building Decision Trees

Build Tree to Predict RainTomorrow

Decision Tree is the default model type—simply click Execute.

http: // togaware. com Copyright© 2014, [email protected] 82/174

Predictive Data Mining: Decision Trees Building Decision Trees

Decision Tree Predicting RainTomorrow

Click the Draw button to display a tree(Settings → Advanced Graphics).

http: // togaware. com Copyright© 2014, [email protected] 83/174

Predictive Data Mining: Decision Trees Building Decision Trees

Evaluate Decision Tree

Click Evaluate tab—options to evaluate model performance.

http: // togaware. com Copyright© 2014, [email protected] 84/174

Predictive Data Mining: Decision Trees Building Decision Trees

Evaluate Decision Tree—Error Matrix

Click Execute to display simple error matrix.Identify the True/False Positives/Negatives.

http: // togaware. com Copyright© 2014, [email protected] 85/174

Predictive Data Mining: Decision Trees Building Decision Trees

Decision Tree Risk Chart

Click the Risk type and then Execute.

http: // togaware. com Copyright© 2014, [email protected] 86/174

Predictive Data Mining: Decision Trees Building Decision Trees

Decision Tree ROC Curve

Click the ROC type and then Execute.

http: // togaware. com Copyright© 2014, [email protected] 87/174

Page 16: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Predictive Data Mining: Decision Trees Building Decision Trees

Score a Dataset

Click the Score type to score a new dataset using model.

http: // togaware. com Copyright© 2014, [email protected] 88/174

Predictive Data Mining: Decision Trees Building Decision Trees

Log of R Commands

Click the Log tab for a history of all your interactions.Save the log contents as a script to repeat what we did.

http: // togaware. com Copyright© 2014, [email protected] 89/174

Predictive Data Mining: Decision Trees Building Decision Trees

Log of R Commands—rpart()

Here we see the call to rpart() to build the model.Click on the Export button to save the script to file.

http: // togaware. com Copyright© 2014, [email protected] 90/174

Predictive Data Mining: Decision Trees Building Decision Trees

Help → Model → Tree

Rattle provides some basic help—click Yes for R help.

http: // togaware. com Copyright© 2014, [email protected] 91/174

Predictive Data Mining: Decision Trees Doing It In R

Weather Dataset - Inputs

ds <- weather

head(ds, 4)

## Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine

## 1 2007-11-01 Canberra 8.0 24.3 0.0 3.4 6.3

## 2 2007-11-02 Canberra 14.0 26.9 3.6 4.4 9.7

## 3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3

## 4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1

....

summary(ds[c(3:5,23)])

## MinTemp MaxTemp Rainfall RISK_MM

## Min. :-5.30 Min. : 7.6 Min. : 0.00 Min. : 0.00

## 1st Qu.: 2.30 1st Qu.:15.0 1st Qu.: 0.00 1st Qu.: 0.00

## Median : 7.45 Median :19.6 Median : 0.00 Median : 0.00

## Mean : 7.27 Mean :20.6 Mean : 1.43 Mean : 1.43

....

http: // togaware. com Copyright© 2014, [email protected] 92/174

Predictive Data Mining: Decision Trees Doing It In R

Weather Dataset - Target

target <- "RainTomorrow"

summary(ds[target])

## RainTomorrow

## No :300

## Yes: 66

(form <- formula(paste(target, "~ .")))

## RainTomorrow ~ .

(vars <- names(ds)[-c(1, 2, 23)])

## [1] "MinTemp" "MaxTemp" "Rainfall" "Evaporation"

## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "WindDir9am"

## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am"

## [13] "Humidity3pm" "Pressure9am" "Pressure3pm" "Cloud9am"

## [17] "Cloud3pm" "Temp9am" "Temp3pm" "RainToday"

## [21] "RainTomorrow"

http: // togaware. com Copyright© 2014, [email protected] 93/174

Page 17: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Predictive Data Mining: Decision Trees Doing It In R

Simple Train/Test Paradigm

set.seed(1421)

train <- c(sample(1:nrow(ds), 0.70*nrow(ds))) # Training dataset

head(train)

## [1] 288 298 363 107 70 232

length(train)

## [1] 256

test <- setdiff(1:nrow(ds), train) # Testing dataset

length(test)

## [1] 110

http: // togaware. com Copyright© 2014, [email protected] 94/174

Predictive Data Mining: Decision Trees Doing It In R

Display the Model

model <- rpart(form, ds[train, vars])

model

## n= 256

##

## node), split, n, loss, yval, (yprob)

## * denotes terminal node

##

## 1) root 256 44 No (0.82812 0.17188)

## 2) Humidity3pm< 59.5 214 21 No (0.90187 0.09813)

## 4) WindGustSpeed< 64 204 14 No (0.93137 0.06863)

## 8) Cloud3pm< 6.5 163 5 No (0.96933 0.03067) *

## 9) Cloud3pm>=6.5 41 9 No (0.78049 0.21951)

## 18) Temp3pm< 26.1 34 4 No (0.88235 0.11765) *

## 19) Temp3pm>=26.1 7 2 Yes (0.28571 0.71429) *

....

Notice the legend to help interpret the tree.

http: // togaware. com Copyright© 2014, [email protected] 95/174

Predictive Data Mining: Decision Trees Doing It In R

Performance on Test Dataset

The predict() function is used to score new data.

head(predict(model, ds[test,], type="class"))

## 2 4 6 8 11 12

## No No No No No No

## Levels: No Yes

table(predict(model, ds[test,], type="class"), ds[test, target])

##

## No Yes

## No 77 14

## Yes 11 8

http: // togaware. com Copyright© 2014, [email protected] 96/174

Predictive Data Mining: Decision Trees Doing It In R

Example DTree Plot using Rattle

yes no

1

2

4

8

9

18 19 5

3

6 7

Humidity3pm < 60

WindGustSpeed < 64

Cloud3pm < 6.5

Temp3pm < 26

Pressure3pm >= 1015

No.83 .17100%

No.90 .10

84%

No.93 .07

80%

No.97 .03

64%

No.78 .22

16%

No.88 .12

13%

Yes.29 .71

3%

Yes.30 .70

4%

Yes.45 .55

16%

No.79 .21

7%

Yes.17 .83

9%

yes no

1

2

4

8

9

18 19 5

3

6 7

Rattle 2014−Mar−01 07:25:21 gjw

http: // togaware. com Copyright© 2014, [email protected] 97/174

Predictive Data Mining: Decision Trees Doing It In R

An R Scripting Hint

Notice the use of variables ds, target, vars.

Change these variables, and the remaining script is unchanged.

Simplifies script writing and reuse of scripts.

ds <- iris

target <- "Species"

vars <- names(ds)

Then repeat the rest of the script, without change.

http: // togaware. com Copyright© 2014, [email protected] 98/174

Predictive Data Mining: Decision Trees Doing It In R

An R Scripting Hint — Unchanged Code

This code remains the same to build the decision tree.

form <- formula(paste(target, "~ ."))

train <- c(sample(1:nrow(ds), 0.70*nrow(ds)))

test <- setdiff(1:nrow(ds), train)

model <- rpart(form, ds[train, vars])

model

## n= 105

##

## node), split, n, loss, yval, (yprob)

## * denotes terminal node

##

## 1) root 105 69 setosa (0.34286 0.32381 0.33333)

## 2) Petal.Length< 2.6 36 0 setosa (1.00000 0.00000 0.00000) *

## 3) Petal.Length>=2.6 69 34 virginica (0.00000 0.49275 0.50725)

## 6) Petal.Length< 4.95 35 2 versicolor (0.00000 0.94286 0.05714) *

## 7) Petal.Length>=4.95 34 1 virginica (0.00000 0.02941 0.97059) *

http: // togaware. com Copyright© 2014, [email protected] 99/174

Page 18: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Predictive Data Mining: Decision Trees Doing It In R

An R Scripting Hint — Unchanged Code

Similarly for the predictions.

head(predict(model, ds[test,], type="class"))

## 3 8 9 10 11 12

## setosa setosa setosa setosa setosa setosa

## Levels: setosa versicolor virginica

table(predict(model, ds[test,], type="class"), ds[test, target])

##

## setosa versicolor virginica

## setosa 14 0 0

## versicolor 0 15 4

## virginica 0 1 11

http: // togaware. com Copyright© 2014, [email protected] 100/174

Predictive Data Mining: Decision Trees Summary

Summary

Most widely deployed machine learning algorithm.

Simple idea, powerful learner.

Available in R through the rpart package.

Related packages include party, Cubist, C50, RWeka (J48).

http: // togaware. com Copyright© 2014, [email protected] 101/174

Predictive Data Mining: Ensembles

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, [email protected] 102/174

Predictive Data Mining: Ensembles

Building Multiple Models

General idea developed in Multiple Inductive Learning algorithm(Williams 1987).

Ideas were developed (ACJ 1987, PhD 1990) in the context of:

observe that variable selection methods don’t discriminate;so build multiple decision trees;then combine into a single model.

Basic idea is that multiple models, like multiple experts, mayproduce better results when working together, rather than inisolation

Two approaches covered: Boosting and Random Forests.

Meta learners.

http: // togaware. com Copyright© 2014, [email protected] 103/174

Predictive Data Mining: Ensembles Boosting

Boosting Algorithms

Basic idea: boost observations that are “hard to model.”Algorithm: iteratively build weak models using a poor learner:

Build an initial model;

Identify mis-classified cases in the training dataset;

Boost (over-represent) training observations modelled incorrectly;

Build a new model on the boosted training dataset;

Repeat.

The result is an ensemble of weighted models.

Best off the shelf model builder. (Leo Brieman)

http: // togaware. com Copyright© 2014, [email protected] 104/174

Predictive Data Mining: Ensembles Boosting

Example: Ada on Weather Data

head(weather[c(1:5, 23, 24)], 3)

## Date Location MinTemp MaxTemp Rainfall RISK_MM...

## 1 2007-11-01 Canberra 8.0 24.3 0.0 3.6...

## 2 2007-11-02 Canberra 14.0 26.9 3.6 3.6...

## 3 2007-11-03 Canberra 13.7 23.4 3.6 39.8...

....

set.seed(42)

train <- sample(1:nrow(weather), 0.7 * nrow(weather))

(m <- ada(RainTomorrow ~ ., weather[train, -c(1:2, 23)]))

## Call:

## ada(RainTomorrow ~ ., data=weather[train, -c(1:2, 23)])

##

## Loss: exponential Method: discrete Iteration: 50

....

http: // togaware. com Copyright© 2014, [email protected] 105/174

Page 19: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Predictive Data Mining: Ensembles Boosting

Example: Error Rate

Notice error rate decreases quickly then flattens.

plot(m)

0 10 20 30 40 50

0.06

0.08

0.10

0.12

0.14

Iteration 1 to 50

Err

or

50

1

1

1

1

1

Training Error

1 Train

http: // togaware. com Copyright© 2014, [email protected] 106/174

Predictive Data Mining: Ensembles Boosting

Example: Variable Importance

Helps understand the knowledge captured.

varplot(m)

RainToday

WindDir9am

Rainfall

WindDir3pm

WindGustDir

Pressure3pm

Sunshine

WindGustSpeed

Cloud3pm

WindSpeed9am

WindSpeed3pm

Humidity9am

Cloud9am

Evaporation

MaxTemp

Temp9am

Humidity3pm

MinTemp

Pressure9am

Temp3pm

0.04 0.05 0.06 0.07 0.08 0.09 0.10

Variable Importance Plot

Score

http: // togaware. com Copyright© 2014, [email protected] 107/174

Predictive Data Mining: Ensembles Boosting

Example: Sample Trees

There are 50 trees in all. Here’s the first 3.

fancyRpartPlot(m$model$trees[[1]])

fancyRpartPlot(m$model$trees[[2]])

fancyRpartPlot(m$model$trees[[3]])

yes no

1

2

4

5

10 11 3

Cloud3pm < 7.5

Pressure3pm >= 1012

Humidity3pm < 42

−1.96 .04100%

−1.97 .03

94%

−1.99 .01

76%

−1.87 .13

18%

−1.96 .04

9%

1.72 .28

9%

1.42 .58

6%

yes no

1

2

4

5

10 11 3

Rattle 2014−Mar−01 07:25:22 gjw

yes no

1

2

3

6 7

Pressure3pm >= 1012

Sunshine >= 11

−1.96 .04100%

−1.98 .02

80%

1.82 .18

20%

−11.00 .00

6%

1.63 .37

14%

yes no

1

2

3

6 7

Rattle 2014−Mar−01 07:25:23 gjw

yes no

1

2

3

6 7

Pressure3pm >= 1012

MaxTemp >= 27

−1.97 .03100%

−1.99 .01

82%

1.83 .17

18%

−1.97 .03

6%

1.67 .33

12%

yes no

1

2

3

6 7

Rattle 2014−Mar−01 07:25:23 gjw

http: // togaware. com Copyright© 2014, [email protected] 108/174

Predictive Data Mining: Ensembles Boosting

Example: Performance

predicted <- predict(m, weather[-train,], type="prob")[,2]

actual <- weather[-train,]$RainTomorrow

risks <- weather[-train,]$RISK_MM

riskchart(predicted, actual, risks)

0.10.20.30.40.50.60.70.80.9

Risk Scores

23%

1

2

3

4

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Perfo

rman

ce (%

)

Recall (88%)

Risk (94%)

Precision

http: // togaware. com Copyright© 2014, [email protected] 109/174

Predictive Data Mining: Ensembles Boosting

Example Applications

ATO Application: What life events affect compliance?

First application of the technology — 1995Decision Stumps: Age > NN; Change in Marital Status

Boosted Neural Networks

OCR using neural networks as base learnersDrucker, Schapire, Simard, 1993

http: // togaware. com Copyright© 2014, [email protected] 110/174

Predictive Data Mining: Ensembles Boosting

Summary

1 Boosting is implemented in R in the ada library

2 AdaBoost uses e−m; LogitBoost uses log(1 + e−m); Doom IIuses 1− tanh(m)

3 AdaBoost tends to be sensitive to noise (addressed byBrownBoost)

4 AdaBoost tends not to overfit, and as new models are added,generalisation error tends to improve.

5 Can be proved to converge to a perfect model if the learners arealways better than chance.

http: // togaware. com Copyright© 2014, [email protected] 111/174

Page 20: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Predictive Data Mining: Ensembles Random Forests

Random Forests

Original idea from Leo Brieman and Adele Cutler.

The name is Licensed to Salford Systems!

Hence, R package is randomForest.

Typically presented in context of decision trees.

Random Multinomial Logit uses multiple multinomial logitmodels.

http: // togaware. com Copyright© 2014, [email protected] 112/174

Predictive Data Mining: Ensembles Random Forests

Random Forests

Build many decision trees (e.g., 500).

For each tree:

Select a random subset of the training set (N);Choose different subsets of variables for each node of thedecision tree (m << M);Build the tree without pruning (i.e., overfit)

Classify a new entity using every decision tree:

Each tree “votes” for the entity.The decision with the largest number of votes wins!The proportion of votes is the resulting score.

http: // togaware. com Copyright© 2014, [email protected] 113/174

Predictive Data Mining: Ensembles Random Forests

Example: RF on Weather Data

set.seed(42)

(m <- randomForest(RainTomorrow ~ ., weather[train, -c(1:2, 23)],

na.action=na.roughfix,

importance=TRUE))

##

## Call:

## randomForest(formula=RainTomorrow ~ ., data=weath...

## Type of random forest: classification

## Number of trees: 500

## No. of variables tried at each split: 4

##

## OOB estimate of error rate: 13.67%

## Confusion matrix:

## No Yes class.error

## No 211 4 0.0186

## Yes 31 10 0.7561

http: // togaware. com Copyright© 2014, [email protected] 114/174

Predictive Data Mining: Ensembles Random Forests

Example: Error Rate

Error rate decreases quickly then flattens over the 500 trees.

plot(m)

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

m

trees

Err

or

http: // togaware. com Copyright© 2014, [email protected] 115/174

Predictive Data Mining: Ensembles Random Forests

Example: Variable Importance

Helps understand the knowledge captured.

varImpPlot(m, main="Variable Importance")

RainTodayRainfallWindDir3pmWindDir9amEvaporationWindGustDirHumidity9amWindSpeed9amWindSpeed3pmCloud9amMinTempHumidity3pmTemp9amPressure9amMaxTempWindGustSpeedTemp3pmPressure3pmCloud3pmSunshine

0 5 10 15MeanDecreaseAccuracy

RainTodayRainfallWindDir3pmWindGustDirWindDir9amEvaporationCloud9amWindSpeed9amWindSpeed3pmHumidity9amMaxTempTemp9amTemp3pmMinTempHumidity3pmWindGustSpeedPressure9amCloud3pmSunshinePressure3pm

0 2 4 6 8MeanDecreaseGini

Variable Importance

http: // togaware. com Copyright© 2014, [email protected] 116/174

Predictive Data Mining: Ensembles Random Forests

Example: Sample Trees

There are 500 trees in all. Here’s some rules from the first tree.

## Random Forest Model 1

##

## ------------------------------------------------------...

## Tree 1 Rule 1 Node 30 Decision No

##

## 1: Evaporation <= 9

## 2: Humidity3pm <= 71

## 3: Cloud3pm <= 2.5

## 4: WindDir9am IN ("NNE")

## 5: Sunshine <= 10.25

## 6: Temp3pm <= 17.55

## ------------------------------------------------------...

## Tree 1 Rule 2 Node 31 Decision Yes

##

## 1: Evaporation <= 9

## 2: Humidity3pm <= 71

....

http: // togaware. com Copyright© 2014, [email protected] 117/174

Page 21: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Predictive Data Mining: Ensembles Random Forests

Example: Performance

predicted <- predict(m, weather[-train,], type="prob")[,2]

actual <- weather[-train,]$RainTomorrow

risks <- weather[-train,]$RISK_MM

riskchart(predicted, actual, risks)

0.10.20.30.40.50.60.70.80.9

Risk Scores

22%

1

2

3

4

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Perfo

rman

ce (%

)

Recall (92%)

Risk (97%)

Precision

http: // togaware. com Copyright© 2014, [email protected] 118/174

Predictive Data Mining: Ensembles Random Forests

Features of Random Forests: By Brieman

Most accurate of current algorithms.

Runs efficiently on large data sets.

Can handle thousands of input variables.

Gives estimates of variable importance.

http: // togaware. com Copyright© 2014, [email protected] 119/174

Predictive Data Mining: Ensembles Random Forests

Random Forests

Ensemble: Multiple models working together

Often better than a single model

Variance and bias of the model are reduced

The best available models today - accurate and robust

In daily use in very many areas of application

http: // togaware. com Copyright© 2014, [email protected] 120/174

Moving into R and Scripting our Analyses

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, [email protected] 121/174

Moving into R and Scripting our Analyses

Data Scientists are Programmers of Data

But. . .

Data scientists are programmers of data

A GUI can only do so much

R is a powerful statistical language

Data Scientists Desire. . .

Scripting

Transparency

Repeatability

Sharing

http: // togaware. com Copyright© 2014, [email protected] 122/174

Moving into R and Scripting our Analyses

From GUI to CLI — Rattle’s Log Tab

http: // togaware. com Copyright© 2014, [email protected] 123/174

Page 22: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Moving into R and Scripting our Analyses

From GUI to CLI — Rattle’s Log Tab

http: // togaware. com Copyright© 2014, [email protected] 124/174

Moving into R and Scripting our Analyses

Step 1: Load the Dataset

dsname <- "weather"

ds <- get(dsname)

dim(ds)

## [1] 366 24

names(ds)

## [1] "Date" "Location" "MinTemp" "...

## [5] "Rainfall" "Evaporation" "Sunshine" "...

## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "...

## [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "...

....

http: // togaware. com Copyright© 2014, [email protected] 125/174

Moving into R and Scripting our Analyses

Step 2: Observe the Data — Observations

head(ds)

## Date Location MinTemp MaxTemp Rainfall Evapora...

## 1 2007-11-01 Canberra 8.0 24.3 0.0 ...

## 2 2007-11-02 Canberra 14.0 26.9 3.6 ...

## 3 2007-11-03 Canberra 13.7 23.4 3.6 ...

....

tail(ds)

## Date Location MinTemp MaxTemp Rainfall Evapo...

## 361 2008-10-26 Canberra 7.9 26.1 0 ...

## 362 2008-10-27 Canberra 9.0 30.7 0 ...

## 363 2008-10-28 Canberra 7.1 28.4 0 ...

....

http: // togaware. com Copyright© 2014, [email protected] 126/174

Moving into R and Scripting our Analyses

Step 2: Observe the Data — Structure

str(ds)

## 'data.frame': 366 obs. of 24 variables:

## $ Date : Date, format: "2007-11-01" "2007-11-...

## $ Location : Factor w/ 46 levels "Adelaide","Alba...

## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ...

## $ MaxTemp : num 24.3 26.9 23.4 15.5 16.1 16.9 1...

## $ Rainfall : num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16...

## $ Evaporation : num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6...

## $ Sunshine : num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4....

## $ WindGustDir : Ord.factor w/ 16 levels "N"<"NNE"<"N...

## $ WindGustSpeed: num 30 39 85 54 50 44 43 41 48 31 ...

## $ WindDir9am : Ord.factor w/ 16 levels "N"<"NNE"<"N...

## $ WindDir3pm : Ord.factor w/ 16 levels "N"<"NNE"<"N...

....

http: // togaware. com Copyright© 2014, [email protected] 127/174

Moving into R and Scripting our Analyses

Step 2: Observe the Data — Summary

summary(ds)

## Date Location MinTemp ...

## Min. :2007-11-01 Canberra :366 Min. :-5.3...

## 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.3...

## Median :2008-05-01 Albany : 0 Median : 7.4...

## Mean :2008-05-01 Albury : 0 Mean : 7.2...

## 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.5...

## Max. :2008-10-31 BadgerysCreek: 0 Max. :20.9...

## (Other) : 0 ...

## Rainfall Evaporation Sunshine Wind...

## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW ...

## 1st Qu.: 0.00 1st Qu.: 2.20 1st Qu.: 5.95 NNW ...

## Median : 0.00 Median : 4.20 Median : 8.60 E ...

....

http: // togaware. com Copyright© 2014, [email protected] 128/174

Moving into R and Scripting our Analyses

Step 2: Observe the Data — Variables

id <- c("Date", "Location")

target <- "RainTomorrow"

risk <- "RISK_MM"

(ignore <- union(id, risk))

## [1] "Date" "Location" "RISK_MM"

(vars <- setdiff(names(ds), ignore))

## [1] "MinTemp" "MaxTemp" "Rainfall" "...

## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "...

## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "...

## [13] "Humidity3pm" "Pressure9am" "Pressure3pm" "...

....

http: // togaware. com Copyright© 2014, [email protected] 129/174

Page 23: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Moving into R and Scripting our Analyses

Step 3: Clean the Data — Remove Missing

dim(ds)

## [1] 366 24

sum(is.na(ds[vars]))

## [1] 47

ds <- ds[-attr(na.omit(ds[vars]), "na.action"),]

http: // togaware. com Copyright© 2014, [email protected] 130/174

Moving into R and Scripting our Analyses

Step 3: Clean the Data — Remove Missing

dim(ds)

## [1] 328 24

sum(is.na(ds[vars]))

## [1] 0

http: // togaware. com Copyright© 2014, [email protected] 131/174

Moving into R and Scripting our Analyses

Step 3: Clean the Data—Target as Categoric

summary(ds[target])

## RainTomorrow

## Min. :0.000

## 1st Qu.:0.000

## Median :0.000

## Mean :0.183

## 3rd Qu.:0.000

## Max. :1.000

....

ds[target] <- as.factor(ds[[target]])

levels(ds[target]) <- c("No", "Yes")

http: // togaware. com Copyright© 2014, [email protected] 132/174

Moving into R and Scripting our Analyses

Step 3: Clean the Data—Target as Categoric

summary(ds[target])

## RainTomorrow

## 0:268

## 1: 60

0

100

200

0 1RainTomorrow

coun

t

http: // togaware. com Copyright© 2014, [email protected] 133/174

Moving into R and Scripting our Analyses

Step 4: Prepare for Modelling

(form <- formula(paste(target, "~ .")))

## RainTomorrow ~ .

(nobs <- nrow(ds))

## [1] 328

train <- sample(nobs, 0.70*nobs)

length(train)

## [1] 229

test <- setdiff(1:nobs, train)

length(test)

## [1] 99

http: // togaware. com Copyright© 2014, [email protected] 134/174

Moving into R and Scripting our Analyses

Step 5: Build the Model—Random Forest

library(randomForest)

model <- randomForest(form, ds[train, vars], na.action=na.omit)

model

##

## Call:

## randomForest(formula=form, data=ds[train, vars], ...

## Type of random forest: classification

## Number of trees: 500

## No. of variables tried at each split: 4

....

http: // togaware. com Copyright© 2014, [email protected] 135/174

Page 24: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Moving into R and Scripting our Analyses

Step 6: Evaluate the Model—Risk Chart

pr <- predict(model, ds[test,], type="prob")[,2]

riskchart(pr, ds[test, target], ds[test, risk],

title="Random Forest - Risk Chart",

risk=risk, recall=target, thresholds=c(0.35, 0.15))

0.10.20.30.40.50.60.70.80.9Risk Scores

15%1

2

3

4

5

6

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Per

form

ance

(%

)

RainTomorrow (99%)

RISK_MM (100%)

Precision

Random Forest − Risk Chart

http: // togaware. com Copyright© 2014, [email protected] 136/174

Literate Data Mining in R

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, [email protected] 137/174

Literate Data Mining in R Motivation

Why is Reproducibility Important?

Your Research Leader or Executive drops by and asks:

“Remember that research you did last year? I’ve heard there isan update on the data that you used. Can you add the new datain and repeat the same analysis?”

“Jo Bloggs did a great analysis of the company returns data justbefore she left. Can you get someone else to analyse the newdata set using the same methods, and so produce an updatedreport that we can present to the Exec next week?”

“The fraud case you provided an analysis of last year has finallyreached the courts. We need to ensure we have a clear trail of thedata sources, the analyses performed, and the results obtained,to stand up in court. Could you document these please.”

http: // togaware. com Copyright© 2014, [email protected] 138/174

Literate Data Mining in R Motivation

Literate Data Mining Overview

One document to intermix the analysis, code, and results

Authors productive with narrative and code in one document

Sweave (Leisch 2002) and now KnitR (Yihui 2011)

Embed R code into LATEX documents for typesetting

KnitR also supports publishing to the web

http: // togaware. com Copyright© 2014, [email protected] 139/174

Literate Data Mining in R Motivation

Why Reproducible Data Mining?

Automatically regenerate documents when code, data, orassumptions change.

Eliminate errors that occur when transcribing results intodocuments.

Record the context for the analysis and decisions made about thetype of analysis to perform in the one place.

Document the processes to provide integrity for the conclusionsof the analysis.

Share approach with others for peer review and for learning fromeach other—engender a continuous learning environment.

http: // togaware. com Copyright© 2014, [email protected] 140/174

Literate Data Mining in R Motivation

Prime Objective: Trustworthy Software

Those who receive the results of modern data analysis havelimited opportunity to verify the results by direct observation.Users of the analysis have no option but to trust the analysis,and by extension the software that produced it. This places anobligation on all creators of software to program in such a waythat the computations can be understood and trusted. Thisobligation I label the Prime Directive.

John Chambers (2008)Software for Data Analysis: Programming with R

http: // togaware. com Copyright© 2014, [email protected] 141/174

Page 25: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Literate Data Mining in R Motivation

Beautiful Output by KnitR

KnitR combined with LATEX will

Intermix analysis and results of analysis

Automatically generate graphics and tables

Support reproducible and transparent analysis

Produce the best looking reports.

http: // togaware. com Copyright© 2014, [email protected] 142/174

Literate Data Mining in R Motivation

Beautiful Output by Default

The reader wants to read the document and easily do so!

Code highlighting is done automatically

Default theme is carefully designed

Many other themes are available

R Code is “properly” reformatted

Analyses (Graphs and Tables) automatically included.

http: // togaware. com Copyright© 2014, [email protected] 143/174

Literate Data Mining in R Using RStudio

Supporting Technology

A suite of Free and Open Source Software — FLOSS

RStudio — Creating, managing, compiling documents

LATEX — Markup language for typesetting a document

R — Statistical analysis language

KnitR — Integrator of typesetting and analysis

http: // togaware. com Copyright© 2014, [email protected] 144/174

Literate Data Mining in R Using RStudio

Using RStudio

Simplified interaction with R, LATEX, and KnitR

Executes R code one line at a time

Formats LATEX documents and provides and spell checking

A single click compile to PDF and synchronised views

Demonstrate: Startup and explore RStudio.

http: // togaware. com Copyright© 2014, [email protected] 145/174

Literate Data Mining in R Using RStudio

Introducing LATEX

A text markup language rather than a WYSIWYG.

Based on TEX from 1977 — very stable and powerful.

LATEX is easier to use macro package built on TEX.

Ensures consistent style (layout, fonts, tables, maths, etc.)

Automatic indexes, footnotes and references.

Documents are well structured and are clear text.

Has a learning curve.

http: // togaware. com Copyright© 2014, [email protected] 146/174

Literate Data Mining in R Using RStudio

Basic LATEX Usage

\documentclass{article}

\begin{document}

\end{document}

Demonstrate Create a new Sweave document in RStudio

http: // togaware. com Copyright© 2014, [email protected] 147/174

Page 26: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Literate Data Mining in R Using RStudio

Structures

\documentclass{article}

\begin{document}

\section{Introduction}

...

\subsection{Concepts}

...

\end{document}

http: // togaware. com Copyright© 2014, [email protected] 148/174

Literate Data Mining in R Using RStudio

Formats

\documentclass{article}

\begin{document}

\begin{itemize}

\item ABC

\item DEF

\end{itemize}

This if \textbf{bold} text or \textbf{italic} text, ...

\end{document}

http: // togaware. com Copyright© 2014, [email protected] 149/174

Literate Data Mining in R Using RStudio

RStudio Support for LATEX

RStudio provides excellent support for working with LATEX documents

Helps to avoid having to know too much abuot LATEX

Best illustrated through a demonstration

Format menu

Section commandsFont commandsList commandsVerbatim/Block commands

Spell Checker

Compile PDF

Demonstrate: Start a new document, add contents, format to PDF.

http: // togaware. com Copyright© 2014, [email protected] 150/174

Literate Data Mining in R Incorporating R Code

Incorporating R Code

We insert R code in a Chunk starting with << >>=

We terminate the Chunk with @

Save LATEX with extension Rnw

This Chunk<<simple_example>>=

x <- sum(1:10)

x

@

Produces

x <- sum(1:10)

x

## [1] 55

Demonstrate: Do this in RStudio

http: // togaware. com Copyright© 2014, [email protected] 151/174

Literate Data Mining in R Incorporating R Code

Making You Look Good

<<format_example>>=

for(i in 1:5){j<-cos(sin(i)*i^2)+3;print(j-5)}

@

for(i in 1:5)

{j <- cos(sin(i)*i^2)+3

print(j-5)

}

## [1] -1.334

## [1] -2.88

## [1] -1.704

## [1] -1.103

....

http: // togaware. com Copyright© 2014, [email protected] 152/174

Literate Data Mining in R Incorporating R Code

R Within the Text

Include information about data within the narrative.

We can do that with \Sexpr{...}.

Our dataset has \Sexpr{nrow(ds)} observations of\Sexpr{ncol(ds)} variables.

Becomes

Our dataset has 328 observations of 24 variables.

Better Still: \Sexpr{format(nrow(ds), big.mark=",")}

Our dataset has 328 observations of 24 variables.

http: // togaware. com Copyright© 2014, [email protected] 153/174

Page 27: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Literate Data Mining in R Incorporating R Code

A Simple Table

library(xtable)

obs <- sample(1:nrow(weatherAUS), 8)

vars <- 2:6

xtable(weatherAUS[obs, vars])

Location MinTemp MaxTemp Rainfall Evaporation

79696 AliceSprings 5.90 21.70 0.00 4.0032428 Ballarat 2.00 10.60 0.2058707 MountGambier 8.50 19.70 2.20 1.6022420 WaggaWagga 1.30 17.40 0.00 3.6050382 Cairns 24.00 31.80 0.00 4.60

1441 Albury 11.20 32.20 0.0078992 AliceSprings 13.80 30.90 0.00 5.8066576 PearceRAAF 5.60 22.90 0.00

http: // togaware. com Copyright© 2014, [email protected] 154/174

Literate Data Mining in R Incorporating R Code

Table: Exclude Row Names

print(xtable(weatherAUS[obs, vars]),

include.rownames=FALSE)

Location MinTemp MaxTemp Rainfall Evaporation

AliceSprings 5.90 21.70 0.00 4.00Ballarat 2.00 10.60 0.20MountGambier 8.50 19.70 2.20 1.60WaggaWagga 1.30 17.40 0.00 3.60Cairns 24.00 31.80 0.00 4.60Albury 11.20 32.20 0.00AliceSprings 13.80 30.90 0.00 5.80PearceRAAF 5.60 22.90 0.00

http: // togaware. com Copyright© 2014, [email protected] 155/174

Literate Data Mining in R Incorporating R Code

Table: Limit Number of Digits

print(xtable(weatherAUS[obs, vars],

digits=1),

include.rownames=FALSE)

Location MinTemp MaxTemp Rainfall Evaporation

AliceSprings 5.9 21.7 0.0 4.0Ballarat 2.0 10.6 0.2MountGambier 8.5 19.7 2.2 1.6WaggaWagga 1.3 17.4 0.0 3.6Cairns 24.0 31.8 0.0 4.6Albury 11.2 32.2 0.0AliceSprings 13.8 30.9 0.0 5.8PearceRAAF 5.6 22.9 0.0

http: // togaware. com Copyright© 2014, [email protected] 156/174

Literate Data Mining in R Incorporating R Code

Table: Tiny Font

vars <- 2:8

print(xtable(weatherAUS[obs, vars],

digits=0),

size="tiny",

include.rownames=FALSE)

Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir

AliceSprings 6 22 0 4 11 SEBallarat 2 11 0 WMountGambier 8 20 2 2 3 SWWaggaWagga 1 17 0 4 11 SSWCairns 24 32 0 5 9 NEAlbury 11 32 0 NNEAliceSprings 14 31 0 6 11 NPearceRAAF 6 23 0 11 S

http: // togaware. com Copyright© 2014, [email protected] 157/174

Literate Data Mining in R Incorporating R Code

Table: Column Alignment

vars <- 2:8

print(xtable(weatherAUS[obs, vars],

digits=0,

align="rlrrrrrr"),

size="tiny")

Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir

79696 AliceSprings 6 22 0 4 11 SE32428 Ballarat 2 11 0 W58707 MountGambier 8 20 2 2 3 SW22420 WaggaWagga 1 17 0 4 11 SSW50382 Cairns 24 32 0 5 9 NE

1441 Albury 11 32 0 NNE78992 AliceSprings 14 31 0 6 11 N66576 PearceRAAF 6 23 0 11 S

http: // togaware. com Copyright© 2014, [email protected] 158/174

Literate Data Mining in R Incorporating R Code

Table: Caption

print(xtable(weatherAUS[obs, vars],

digits=1,

caption="This is the table caption."),

size="tiny")

Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir

79696 AliceSprings 5.9 21.7 0.0 4.0 10.8 SE32428 Ballarat 2.0 10.6 0.2 W58707 MountGambier 8.5 19.7 2.2 1.6 2.8 SW22420 WaggaWagga 1.3 17.4 0.0 3.6 10.8 SSW50382 Cairns 24.0 31.8 0.0 4.6 8.8 NE

1441 Albury 11.2 32.2 0.0 NNE78992 AliceSprings 13.8 30.9 0.0 5.8 11.2 N66576 PearceRAAF 5.6 22.9 0.0 11.0 S

Table : This is the table caption.

http: // togaware. com Copyright© 2014, [email protected] 159/174

Page 28: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Literate Data Mining in R Incorporating R Code

Plots

library(ggplot2)

cities <- c("Canberra", "Darwin", "Melbourne", "Sydney")

ds <- subset(weatherAUS, Location %in% cities & ! is.na(Temp3pm))

g <- ggplot(ds, aes(Temp3pm, colour=Location, fill=Location))

g <- g + geom_density(alpha = 0.55)

print(g)

0.00

0.05

0.10

0.15

0.20

10 20 30 40Temp3pm

dens

ity

Location

Canberra

Darwin

Melbourne

Sydney

http: // togaware. com Copyright© 2014, [email protected] 160/174

Literate Data Mining in R Our First KnitR Document

Create a KnitR Document: New→R Sweave

http: // togaware. com Copyright© 2014, [email protected] 161/174

Literate Data Mining in R Our First KnitR Document

Setup KnitR

We wish to use KnitR rather than the older Sweave processor

In RStudio we can configure the options to use knitr:

Select Tools→Options

Choose the Sweave group

Choose knitr for Weave Rnw files using:

The remaining defaults should be okay

Click Apply and then OK

http: // togaware. com Copyright© 2014, [email protected] 162/174

Literate Data Mining in R Our First KnitR Document

Simple KnitR Document

Insert the following into your new KnitR document:

\title{Sample KnitR Document}

\author{Graham Williams}

\maketitle

\section*{My First Section}

This is some text that is automatically typeset

by the LaTeX processor to produce well formatted

quality output as PDF.

http: // togaware. com Copyright© 2014, [email protected] 163/174

Literate Data Mining in R Our First KnitR Document

Simple KnitR Document

http: // togaware. com Copyright© 2014, [email protected] 164/174

Literate Data Mining in R Our First KnitR Document

Simple KnitR Document—Resulting PDF

Result of Compile PDF

http: // togaware. com Copyright© 2014, [email protected] 165/174

Page 29: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Literate Data Mining in R Including R Commands in KnitR

KnitR: Add R Commands

R code can be used to generate results into the document:

<<echo=FALSE, message=FALSE>>=

library(rattle) # Provides the weather dataset

library(ggplot2) # Provides the qplot() function

ds <- weather

qplot(MinTemp, MaxTemp, data=ds)

@

http: // togaware. com Copyright© 2014, [email protected] 166/174

Literate Data Mining in R Including R Commands in KnitR

KnitR Document With R Code

http: // togaware. com Copyright© 2014, [email protected] 167/174

Literate Data Mining in R Including R Commands in KnitR

Simple KnitR Document—PDF with Plot

Result of Compile PDF

http: // togaware. com Copyright© 2014, [email protected] 168/174

Literate Data Mining in R Basics Cheat Sheet

LaTeX Basics

\subsection*{...} % Introduce a Sub Section

\subsubsection*{...} % Introduce a Sub Sub Section

\textbf{...} % Bold font

\textit{...} % Italic font

\begin{itemize} % A bullet list

\item ...

\item ...

\end{itemize}

Plus an extensive collection of other markup and capabilities.

http: // togaware. com Copyright© 2014, [email protected] 169/174

Literate Data Mining in R Basics Cheat Sheet

KnitR Basics

echo=FALSE # Do not display the R code

eval=TRUE # Evaluate the R code

results="hide" # Hide the results of the R commands

fig.width=10 # Extend figure width from 7 to 10 inches

fig.height=8 # Extend figure height from 7 to 8 inches

out.width="0.8\\textwidth" # Fit figure 80% page width

out.height="0.5\\textheight" # Fit figure 50% page height

Plus an extensive collection of other options.

http: // togaware. com Copyright© 2014, [email protected] 170/174

Literate Data Mining in R Basics Cheat Sheet

Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright© 2014, [email protected] 171/174

Page 30: Why do Data Science with R? · 2014. 4. 26. · Data Mining, Rattle, and R Definition Data Mining in Research Health Research Adverse reactions using linked Pharmaceutical, General

Literate Data Mining in R Resources

Resources: One Page R

A new series of guides to using R.

http://onepager.togaware.com

Data Preparation template:http://onepager.togaware.com/DataO.pdf

Model Building template:http://onepager.togaware.com/ModelO.pdf

Plots using GGPlot2:http://onepager.togaware.com/GGPlot2O.pdf

Test Mining in R:http://onepager.togaware.com/TextMiningO.pdf

http: // togaware. com Copyright© 2014, [email protected] 172/174

Literate Data Mining in R Resources

Resources and References

OnePageR: http://onepager.togaware.com – Tutorial Notes

Rattle: http://rattle.togaware.com

Guides: http://datamining.togaware.com

Practise: http://analystfirst.com

Book: Data Mining using Rattle/R

Chapter: Rattle and Other Tales

Paper: A Data Mining GUI for R — R Journal, Volume 1(2)

http: // togaware. com Copyright© 2014, [email protected] 173/174

That Is All Folks Time For Questions

Thank You

This document, sourced from CUNYWorkshop.Rnw revision 315, was processed by KnitR version 1.5 of 2013-09-28 and took 11.4 seconds toprocess. It was generated by gjw on theano running Ubuntu 13.10 with Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz having 4 cores and 3.9GB

of RAM. It completed the processing 2014-03-01 07:25:28.

http: // togaware. com Copyright© 2014, [email protected] 174/174