R for hadoopers

1

Scalable Analytics with R, Hadoop and RHadoopGwen Shapira, Software Engineer@[email protected]

2

3

Goals:• Teach basic R and Hadoop basics• Share useful R libraries • Explain RHadoop• Give tips on how to use Rhadoop• Have fun, kick ass

4

Not a Goal:Turn you into:• Statistician• Machine learning expert• R guru• Hadoop expert

5

#include warning.h

6

Agenda

• R Basics• Hadoop Basics• Data Manipulation• Rhadoop

7

Get Started with R-Studio

8

Basic Data Types

• String• Number• Boolean• Assignment <-

9

R can be a nice calculator

> x <- 1> x * 2[1] 2> y <- x + 3> y[1] 4> log(y)[1] 1.386294> help(log)

10

Complex Data Types

• Vector• c, seq, rep, []

• List• Data Frame

• Lists of vectors of same length• Not a matrix

11

Creating vectors

> v1 <- c(1,2,3,4)[1] 1 2 3 4> v1 * 4[1] 4 8 12 16> v4 <- c(1:5)[1] 1 2 3 4 5> v2 <- seq(2,12,by=3)[1] 2 5 8 11> v1 * v2[1] 2 10 24 44> v3 <- rep(3,4)[1] 3 3 3 3

12

Accessing and filtering vectors

> v1 <- c(2,4,6,8)[1] 2 4 6 8> v1[2][1] 4> v1[2:4][1] 4 6 8> v1[-2][1] 2 6 8> v1[v1>3][1] 4 6 8

13

Lists

> lst <- list (1,"x",FALSE)[[1]][1] 1

[[2]][1] "x"

[[3]][1] FALSE

> lst[1][[1]][1] 1

> lst[[1]][1] 1

14

Data Frames

books <- read.csv("~/books.csv")books[1,]books[,1]books[3:4]

books$pricebooks[books$price==6.99,]martin_price <- books[books$author_t=="George R.R. Martin",]$pricemean(martin_price)subset(books,select=-c(id,cat,sequence_i))

15

Vectorization:Always prefer operations On entire vectors

16

Functions

> sq <- function(x) { x*x }> sq(3)[1] 9

Note:R is a functional programming language.Functions are first class objectsAnd can be passed to other functions.

17

packages

18

Agenda

• R Basics• Hadoop Basics• Data Manipulation• Rhadoop

— Grace Hopper, early advocate of distributed computing

“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,

we didn’t try to grow a larger ox”

20

Hadoop in a Nutshell

21

Map-Reduce is the interesting bit

• Map – Apply a function to each input record• Shuffle & Sort – Partition the map output and sort

each partition• Reduce – Apply aggregation function to all values in

each partition• Map reads input from disk• Reduce writes output to disk

22

Example – Sessionize clickstream

23

Sessionize

Identify unique “sessions” of interacting with our website

Session – for each user (IP), set of clicks that happened within 30 minutes of each other

24

Input – Apache Access Log Records

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

25

Output – Add Session ID

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 15

26

Overview

Map

Map

Map

Reduce

Reduce

Log line

Log line

Log line

IP1, log lines

IP1, log lines

IP1, log lin

es

Log line, session ID

27

Map

parsedRecord = re.search(‘(\\d+.\\d+….’,record)IP = parsedRecord.group(1)timestamp = parsedRecord.group(2)print ((IP,Timestamp),record)

28

Shuffle & Sort

Partition by: IPSort by: timestamp

Now reduce gets:(IP,timestamp) [record1,record2,record3….]

29

Reduce

SessionID = 1curr_record = records[0]Curr_timestamp = getTimestamp(curr_record)foreach record in records:

if (curr_timestamp – getTimestamp(record) > 30):

sessionID += 1curr_timestamp = getTimestamp(record) print(record + “ “ + sessionID)

30

Agenda

• R Basics• Hadoop Basics• Data Manipulation Libraries• Rhadoop

31

Reshape2

• Two functions: • Melt – wide format to long format• Cast – long format to wide format

• Columns: identifiers or measured variables• Molten data:

• Unique identifiers• New column – variable name• New column – value

• Default – all numbers are values

32

Melt

> tipstotal_bill tip sex smoker day time

size16.99 1.01 Female No Sun Dinner 210.34 1.66 Male No Sun Dinner 321.01 3.50 Male No Sun Dinner 3

> melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size

2

33

Cast

> m_tips <- melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size

2

> dcast(m_tips,sex+time~variable,mean) sex time total_bill tip size Female Dinner 19.21308 3.002115

2.461538 Female Lunch 16.33914 2.582857

2.457143 Male Dinner 21.46145 3.144839

2.701613 Male Lunch 18.04848 2.882121

2.363636

34

*Apply

• apply – apply function on rows or columns of matrix• lapply – apply function on each item of list

• Returns list• sapply – like lapply, but return vector• tapply – apply function to subsets of vector or lists

35

plyr

• Split – apply – combine

• Ddply – data frame to data frameddply(.data, .variables, .fun = NULL, ...,• Summarize – aggregate data into new data frame• Transform – modify data frame

36

DDPLY Example

> ddply(tips,c("sex","time"),summarize,+ mean=mean(tip),+ sd=sd(tip),+ ratio=mean(tip/total_bill)+ ) sex time mean sd ratio1 Female Dinner 3.002115 1.193483 0.16932162 Female Lunch 2.582857 1.075108 0.16228493 Male Dinner 3.144839 1.529116 0.15540654 Male Lunch 2.882121 1.329017 0.1660826

37

Agenda

• R Basics• Hadoop Basics• Data Manipulation Libraries• Rhadoop

38

Rhadoop Projects

• RMR• RHDFS• RHBase• (new) PlyRMR

39

Most Important:RMR does not parallelize algorithms.

It allows you to implement MapReduce in R. Efficiently. That’s it.

40

What does that mean?

• Use RMR if you can break your problem down to small pieces and apply the algorithm there

• Use commercial R+Hadoop if you need a parallel version of well known algorithm

• Good fit: Fit piecewise regression model for each county in the US

• Bad fit: Fit piecewise regression model for the entire US population

• Bad fit: Logistic regression

41

Use-case examples – Good or Bad?

1. Model power consumption per household to determine if incentive programs work

2. Aggregate corn yield per 10x10 portion of field to determine best seeds to use

3. Create churn models for service subscribers and determine who is most likely to cancel

4. Determine correlation between device restarts and support calls

42

Second Most Important:RMR requires R, RMR and all libraries you’ll use to be installed on all nodes and accessible by Hadoop user

43

RMR is different from Hadoop Streaming.

RMR mapper input:Key, [List of Records]

This is so we can use vector operations

44

How to RMRify a Problem

45

In more detail…

• Mappers get list of values• You need to process each one independently• But do it for all lines at once.

• Reducers work normally

46

Demo 6

> library(rmr2)t <- list("hello world","don't worry be happy")unlist(sapply(t,function (x) {strsplit(x," ")}))

function(k,v) {ret_k <- unlist(sapply(v,function(x){strsplit(x," ")}))keyval(ret_k,1)

}

function(k,v) { keyval(k,sum(v))}

mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt",output=”~/wc.json",input.format="text”,output.format=”json",map=wc.map,reduce=wc.reduce);

47

Cheating in MapReduce:Do everything possible to havemap only jobs

48

Avg Tips per Person – Naïve Input

Gwen 1Jeff 2Leon 1Gwen 2.5Leon 3Jeff 1Gwen 1Gwen 2Jeff 1.5

49

Avg Tips per Person - Naive

avg.map <- function(k,v){keyval(v$V1,v$V2)}

avg.reduce <- function(k,v) {keyval(k,mean(v))}

mapreduce(input=”~/hadoop-recipes/data/tip1.txt",output="~/avg.txt",input.format=make.input.format("csv"),output.format="text",map=avg.map,reduce=avg.reduce);

50

Avg Tips per Person – Awesome Input

Gwen 1,2.5,1,2Jeff 2,1,1.5Leon 1,3

51

Avg Tips per Person - Optimized

function(k,v) { v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," ")})) keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))}))}

mapreduce(input=”~/hadoop-recipes/data/tip2.txt",output="~/avg2.txt",input.format=make.input.format("csv",sep=","),output.format="text",map=avg2.map);

52

Few Final RMR Tips

• Backend = “local” has files as input and output• Backend = “hadoop” uses HDFS directories• In “hadoop” mode, print(X) inside the mapper will fail

the job.• Use: cat(“ERROR!”, file = stderr())

53

Recommended Reading

• http://cran.r-project.org/doc/manuals/R-intro.html• http://blog.revolutionanalytics.com/2013/02/10-r-pac

kages-every-data-scientist-should-know-about.html

• http://had.co.nz/reshape/paper-dsc2005.pdf• http://seananderson.ca/2013/12/01/plyr.html• https://github.com/RevolutionAnalytics/rmr2/blob/m

aster/docs/tutorial.md

• http://cran.r-project.org/web/packages/data.table/index.html

http://cran.r-project.org/doc/manuals/R-intro.html

http://cran.r-project.org/doc/manuals/R-intro.html

http://blog.revolutionanalytics.com/2013/02/10-r-packages-every-data-scientist-should-know-about.html



http://had.co.nz/reshape/paper-dsc2005.pdf

http://had.co.nz/reshape/paper-dsc2005.pdf

http://seananderson.ca/2013/12/01/plyr.html




https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md



http://cran.r-project.org/web/packages/data.table/index.html

http://cran.r-project.org/web/packages/data.table/index.html

54

Data & Analytics

R for hadoopers