54
1 Scalable Analytics with R, Hadoop and RHadoop Gwen Shapira, Software Engineer @gwenshap [email protected]

R for hadoopers

Embed Size (px)

DESCRIPTION

R + Hadoop presentation. From OSCON and Seattle meetup

Citation preview

Page 1: R for hadoopers

1

Scalable Analytics with R, Hadoop and RHadoopGwen Shapira, Software Engineer@[email protected]

Page 2: R for hadoopers

2

Page 3: R for hadoopers

3

Goals:• Teach basic R and Hadoop basics• Share useful R libraries • Explain RHadoop• Give tips on how to use Rhadoop• Have fun, kick ass

Page 4: R for hadoopers

4

Not a Goal:Turn you into:• Statistician• Machine learning expert• R guru• Hadoop expert

Page 5: R for hadoopers

5

#include warning.h

Page 6: R for hadoopers

6

Agenda

• R Basics• Hadoop Basics• Data Manipulation• Rhadoop

Page 7: R for hadoopers

7

Get Started with R-Studio

Page 8: R for hadoopers

8

Basic Data Types

• String• Number• Boolean• Assignment <-

Page 9: R for hadoopers

9

R can be a nice calculator

> x <- 1> x * 2[1] 2> y <- x + 3> y[1] 4> log(y)[1] 1.386294> help(log)

Page 10: R for hadoopers

10

Complex Data Types

• Vector• c, seq, rep, []

• List• Data Frame

• Lists of vectors of same length• Not a matrix

Page 11: R for hadoopers

11

Creating vectors

> v1 <- c(1,2,3,4)[1] 1 2 3 4> v1 * 4[1] 4 8 12 16> v4 <- c(1:5)[1] 1 2 3 4 5> v2 <- seq(2,12,by=3)[1] 2 5 8 11> v1 * v2[1] 2 10 24 44> v3 <- rep(3,4)[1] 3 3 3 3

Page 12: R for hadoopers

12

Accessing and filtering vectors

> v1 <- c(2,4,6,8)[1] 2 4 6 8> v1[2][1] 4> v1[2:4][1] 4 6 8> v1[-2][1] 2 6 8> v1[v1>3][1] 4 6 8

Page 13: R for hadoopers

13

Lists

> lst <- list (1,"x",FALSE)[[1]][1] 1

[[2]][1] "x"

[[3]][1] FALSE

> lst[1][[1]][1] 1

> lst[[1]][1] 1

Page 14: R for hadoopers

14

Data Frames

books <- read.csv("~/books.csv")books[1,]books[,1]books[3:4]

books$pricebooks[books$price==6.99,]martin_price <- books[books$author_t=="George R.R. Martin",]$pricemean(martin_price)subset(books,select=-c(id,cat,sequence_i))

Page 15: R for hadoopers

15

Vectorization:Always prefer operations On entire vectors

Page 16: R for hadoopers

16

Functions

> sq <- function(x) { x*x }> sq(3)[1] 9

Note:R is a functional programming language.Functions are first class objectsAnd can be passed to other functions.

Page 17: R for hadoopers

17

packages

Page 18: R for hadoopers

18

Agenda

• R Basics• Hadoop Basics• Data Manipulation• Rhadoop

Page 19: R for hadoopers

— Grace Hopper, early advocate of distributed computing

“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,

we didn’t try to grow a larger ox”

Page 20: R for hadoopers

20

Hadoop in a Nutshell

Page 21: R for hadoopers

21

Map-Reduce is the interesting bit

• Map – Apply a function to each input record• Shuffle & Sort – Partition the map output and sort

each partition• Reduce – Apply aggregation function to all values in

each partition• Map reads input from disk• Reduce writes output to disk

Page 22: R for hadoopers

22

Example – Sessionize clickstream

Page 23: R for hadoopers

23

Sessionize

Identify unique “sessions” of interacting with our website

Session – for each user (IP), set of clicks that happened within 30 minutes of each other

Page 24: R for hadoopers

24

Input – Apache Access Log Records

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

Page 25: R for hadoopers

25

Output – Add Session ID

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 15

Page 26: R for hadoopers

26

Overview

Map

Map

Map

Reduce

Reduce

Log line

Log line

Log line

IP1, log lines

IP1, log lines

IP1, log lin

es

Log line, session ID

Page 27: R for hadoopers

27

Map

parsedRecord = re.search(‘(\\d+.\\d+….’,record)IP = parsedRecord.group(1)timestamp = parsedRecord.group(2)print ((IP,Timestamp),record)

Page 28: R for hadoopers

28

Shuffle & Sort

Partition by: IPSort by: timestamp

Now reduce gets:(IP,timestamp) [record1,record2,record3….]

Page 29: R for hadoopers

29

Reduce

SessionID = 1curr_record = records[0]Curr_timestamp = getTimestamp(curr_record)foreach record in records:

if (curr_timestamp – getTimestamp(record) > 30):

sessionID += 1curr_timestamp = getTimestamp(record) print(record + “ “ + sessionID)

Page 30: R for hadoopers

30

Agenda

• R Basics• Hadoop Basics• Data Manipulation Libraries• Rhadoop

Page 31: R for hadoopers

31

Reshape2

• Two functions: • Melt – wide format to long format• Cast – long format to wide format

• Columns: identifiers or measured variables• Molten data:

• Unique identifiers• New column – variable name• New column – value

• Default – all numbers are values

Page 32: R for hadoopers

32

Melt

> tipstotal_bill tip sex smoker day time

size16.99 1.01 Female No Sun Dinner 210.34 1.66 Male No Sun Dinner 321.01 3.50 Male No Sun Dinner 3

> melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size

2

Page 33: R for hadoopers

33

Cast

> m_tips <- melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size

2

> dcast(m_tips,sex+time~variable,mean) sex time total_bill tip size Female Dinner 19.21308 3.002115

2.461538 Female Lunch 16.33914 2.582857

2.457143 Male Dinner 21.46145 3.144839

2.701613 Male Lunch 18.04848 2.882121

2.363636

Page 34: R for hadoopers

34

*Apply

• apply – apply function on rows or columns of matrix• lapply – apply function on each item of list

• Returns list• sapply – like lapply, but return vector• tapply – apply function to subsets of vector or lists

Page 35: R for hadoopers

35

plyr

• Split – apply – combine

• Ddply – data frame to data frameddply(.data, .variables, .fun = NULL, ...,• Summarize – aggregate data into new data frame• Transform – modify data frame

Page 36: R for hadoopers

36

DDPLY Example

> ddply(tips,c("sex","time"),summarize,+ mean=mean(tip),+ sd=sd(tip),+ ratio=mean(tip/total_bill)+ ) sex time mean sd ratio1 Female Dinner 3.002115 1.193483 0.16932162 Female Lunch 2.582857 1.075108 0.16228493 Male Dinner 3.144839 1.529116 0.15540654 Male Lunch 2.882121 1.329017 0.1660826

Page 37: R for hadoopers

37

Agenda

• R Basics• Hadoop Basics• Data Manipulation Libraries• Rhadoop

Page 38: R for hadoopers

38

Rhadoop Projects

• RMR• RHDFS• RHBase• (new) PlyRMR

Page 39: R for hadoopers

39

Most Important:RMR does not parallelize algorithms.

It allows you to implement MapReduce in R. Efficiently. That’s it.

Page 40: R for hadoopers

40

What does that mean?

• Use RMR if you can break your problem down to small pieces and apply the algorithm there

• Use commercial R+Hadoop if you need a parallel version of well known algorithm

• Good fit: Fit piecewise regression model for each county in the US

• Bad fit: Fit piecewise regression model for the entire US population

• Bad fit: Logistic regression

Page 41: R for hadoopers

41

Use-case examples – Good or Bad?

1. Model power consumption per household to determine if incentive programs work

2. Aggregate corn yield per 10x10 portion of field to determine best seeds to use

3. Create churn models for service subscribers and determine who is most likely to cancel

4. Determine correlation between device restarts and support calls

Page 42: R for hadoopers

42

Second Most Important:RMR requires R, RMR and all libraries you’ll use to be installed on all nodes and accessible by Hadoop user

Page 43: R for hadoopers

43

RMR is different from Hadoop Streaming.

RMR mapper input:Key, [List of Records]

This is so we can use vector operations

Page 44: R for hadoopers

44

How to RMRify a Problem

Page 45: R for hadoopers

45

In more detail…

• Mappers get list of values• You need to process each one independently• But do it for all lines at once.

• Reducers work normally

Page 46: R for hadoopers

46

Demo 6

> library(rmr2)t <- list("hello world","don't worry be happy")unlist(sapply(t,function (x) {strsplit(x," ")}))

function(k,v) {ret_k <- unlist(sapply(v,function(x){strsplit(x," ")}))keyval(ret_k,1)

}

function(k,v) { keyval(k,sum(v))}

mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt",output=”~/wc.json",input.format="text”,output.format=”json",map=wc.map,reduce=wc.reduce);

Page 47: R for hadoopers

47

Cheating in MapReduce:Do everything possible to havemap only jobs

Page 48: R for hadoopers

48

Avg Tips per Person – Naïve Input

Gwen 1Jeff 2Leon 1Gwen 2.5Leon 3Jeff 1Gwen 1Gwen 2Jeff 1.5

Page 49: R for hadoopers

49

Avg Tips per Person - Naive

avg.map <- function(k,v){keyval(v$V1,v$V2)}

avg.reduce <- function(k,v) {keyval(k,mean(v))}

mapreduce(input=”~/hadoop-recipes/data/tip1.txt",output="~/avg.txt",input.format=make.input.format("csv"),output.format="text",map=avg.map,reduce=avg.reduce);

Page 50: R for hadoopers

50

Avg Tips per Person – Awesome Input

Gwen 1,2.5,1,2Jeff 2,1,1.5Leon 1,3

Page 51: R for hadoopers

51

Avg Tips per Person - Optimized

function(k,v) { v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," ")})) keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))}))}

mapreduce(input=”~/hadoop-recipes/data/tip2.txt",output="~/avg2.txt",input.format=make.input.format("csv",sep=","),output.format="text",map=avg2.map);

Page 52: R for hadoopers

52

Few Final RMR Tips

• Backend = “local” has files as input and output• Backend = “hadoop” uses HDFS directories• In “hadoop” mode, print(X) inside the mapper will fail

the job.• Use: cat(“ERROR!”, file = stderr())

Page 54: R for hadoopers

54