Upload
chen-gwen-shapira
View
60
Download
0
Tags:
Embed Size (px)
DESCRIPTION
R + Hadoop presentation. From OSCON and Seattle meetup
Citation preview
1
Scalable Analytics with R, Hadoop and RHadoopGwen Shapira, Software Engineer@[email protected]
2
3
Goals:• Teach basic R and Hadoop basics• Share useful R libraries • Explain RHadoop• Give tips on how to use Rhadoop• Have fun, kick ass
4
Not a Goal:Turn you into:• Statistician• Machine learning expert• R guru• Hadoop expert
5
#include warning.h
6
Agenda
• R Basics• Hadoop Basics• Data Manipulation• Rhadoop
7
Get Started with R-Studio
8
Basic Data Types
• String• Number• Boolean• Assignment <-
9
R can be a nice calculator
> x <- 1> x * 2[1] 2> y <- x + 3> y[1] 4> log(y)[1] 1.386294> help(log)
10
Complex Data Types
• Vector• c, seq, rep, []
• List• Data Frame
• Lists of vectors of same length• Not a matrix
11
Creating vectors
> v1 <- c(1,2,3,4)[1] 1 2 3 4> v1 * 4[1] 4 8 12 16> v4 <- c(1:5)[1] 1 2 3 4 5> v2 <- seq(2,12,by=3)[1] 2 5 8 11> v1 * v2[1] 2 10 24 44> v3 <- rep(3,4)[1] 3 3 3 3
12
Accessing and filtering vectors
> v1 <- c(2,4,6,8)[1] 2 4 6 8> v1[2][1] 4> v1[2:4][1] 4 6 8> v1[-2][1] 2 6 8> v1[v1>3][1] 4 6 8
13
Lists
> lst <- list (1,"x",FALSE)[[1]][1] 1
[[2]][1] "x"
[[3]][1] FALSE
> lst[1][[1]][1] 1
> lst[[1]][1] 1
14
Data Frames
books <- read.csv("~/books.csv")books[1,]books[,1]books[3:4]
books$pricebooks[books$price==6.99,]martin_price <- books[books$author_t=="George R.R. Martin",]$pricemean(martin_price)subset(books,select=-c(id,cat,sequence_i))
15
Vectorization:Always prefer operations On entire vectors
16
Functions
> sq <- function(x) { x*x }> sq(3)[1] 9
Note:R is a functional programming language.Functions are first class objectsAnd can be passed to other functions.
17
packages
18
Agenda
• R Basics• Hadoop Basics• Data Manipulation• Rhadoop
— Grace Hopper, early advocate of distributed computing
“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
we didn’t try to grow a larger ox”
20
Hadoop in a Nutshell
21
Map-Reduce is the interesting bit
• Map – Apply a function to each input record• Shuffle & Sort – Partition the map output and sort
each partition• Reduce – Apply aggregation function to all values in
each partition• Map reads input from disk• Reduce writes output to disk
22
Example – Sessionize clickstream
23
Sessionize
Identify unique “sessions” of interacting with our website
Session – for each user (IP), set of clicks that happened within 30 minutes of each other
24
Input – Apache Access Log Records
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
25
Output – Add Session ID
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 15
26
Overview
Map
Map
Map
Reduce
Reduce
Log line
Log line
Log line
IP1, log lines
IP1, log lines
IP1, log lin
es
Log line, session ID
27
Map
parsedRecord = re.search(‘(\\d+.\\d+….’,record)IP = parsedRecord.group(1)timestamp = parsedRecord.group(2)print ((IP,Timestamp),record)
28
Shuffle & Sort
Partition by: IPSort by: timestamp
Now reduce gets:(IP,timestamp) [record1,record2,record3….]
29
Reduce
SessionID = 1curr_record = records[0]Curr_timestamp = getTimestamp(curr_record)foreach record in records:
if (curr_timestamp – getTimestamp(record) > 30):
sessionID += 1curr_timestamp = getTimestamp(record) print(record + “ “ + sessionID)
30
Agenda
• R Basics• Hadoop Basics• Data Manipulation Libraries• Rhadoop
31
Reshape2
• Two functions: • Melt – wide format to long format• Cast – long format to wide format
• Columns: identifiers or measured variables• Molten data:
• Unique identifiers• New column – variable name• New column – value
• Default – all numbers are values
32
Melt
> tipstotal_bill tip sex smoker day time
size16.99 1.01 Female No Sun Dinner 210.34 1.66 Male No Sun Dinner 321.01 3.50 Male No Sun Dinner 3
> melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size
2
33
Cast
> m_tips <- melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size
2
> dcast(m_tips,sex+time~variable,mean) sex time total_bill tip size Female Dinner 19.21308 3.002115
2.461538 Female Lunch 16.33914 2.582857
2.457143 Male Dinner 21.46145 3.144839
2.701613 Male Lunch 18.04848 2.882121
2.363636
34
*Apply
• apply – apply function on rows or columns of matrix• lapply – apply function on each item of list
• Returns list• sapply – like lapply, but return vector• tapply – apply function to subsets of vector or lists
35
plyr
• Split – apply – combine
• Ddply – data frame to data frameddply(.data, .variables, .fun = NULL, ...,• Summarize – aggregate data into new data frame• Transform – modify data frame
36
DDPLY Example
> ddply(tips,c("sex","time"),summarize,+ mean=mean(tip),+ sd=sd(tip),+ ratio=mean(tip/total_bill)+ ) sex time mean sd ratio1 Female Dinner 3.002115 1.193483 0.16932162 Female Lunch 2.582857 1.075108 0.16228493 Male Dinner 3.144839 1.529116 0.15540654 Male Lunch 2.882121 1.329017 0.1660826
37
Agenda
• R Basics• Hadoop Basics• Data Manipulation Libraries• Rhadoop
38
Rhadoop Projects
• RMR• RHDFS• RHBase• (new) PlyRMR
39
Most Important:RMR does not parallelize algorithms.
It allows you to implement MapReduce in R. Efficiently. That’s it.
40
What does that mean?
• Use RMR if you can break your problem down to small pieces and apply the algorithm there
• Use commercial R+Hadoop if you need a parallel version of well known algorithm
• Good fit: Fit piecewise regression model for each county in the US
• Bad fit: Fit piecewise regression model for the entire US population
• Bad fit: Logistic regression
41
Use-case examples – Good or Bad?
1. Model power consumption per household to determine if incentive programs work
2. Aggregate corn yield per 10x10 portion of field to determine best seeds to use
3. Create churn models for service subscribers and determine who is most likely to cancel
4. Determine correlation between device restarts and support calls
42
Second Most Important:RMR requires R, RMR and all libraries you’ll use to be installed on all nodes and accessible by Hadoop user
43
RMR is different from Hadoop Streaming.
RMR mapper input:Key, [List of Records]
This is so we can use vector operations
44
How to RMRify a Problem
45
In more detail…
• Mappers get list of values• You need to process each one independently• But do it for all lines at once.
• Reducers work normally
46
Demo 6
> library(rmr2)t <- list("hello world","don't worry be happy")unlist(sapply(t,function (x) {strsplit(x," ")}))
function(k,v) {ret_k <- unlist(sapply(v,function(x){strsplit(x," ")}))keyval(ret_k,1)
}
function(k,v) { keyval(k,sum(v))}
mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt",output=”~/wc.json",input.format="text”,output.format=”json",map=wc.map,reduce=wc.reduce);
47
Cheating in MapReduce:Do everything possible to havemap only jobs
48
Avg Tips per Person – Naïve Input
Gwen 1Jeff 2Leon 1Gwen 2.5Leon 3Jeff 1Gwen 1Gwen 2Jeff 1.5
49
Avg Tips per Person - Naive
avg.map <- function(k,v){keyval(v$V1,v$V2)}
avg.reduce <- function(k,v) {keyval(k,mean(v))}
mapreduce(input=”~/hadoop-recipes/data/tip1.txt",output="~/avg.txt",input.format=make.input.format("csv"),output.format="text",map=avg.map,reduce=avg.reduce);
50
Avg Tips per Person – Awesome Input
Gwen 1,2.5,1,2Jeff 2,1,1.5Leon 1,3
51
Avg Tips per Person - Optimized
function(k,v) { v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," ")})) keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))}))}
mapreduce(input=”~/hadoop-recipes/data/tip2.txt",output="~/avg2.txt",input.format=make.input.format("csv",sep=","),output.format="text",map=avg2.map);
52
Few Final RMR Tips
• Backend = “local” has files as input and output• Backend = “hadoop” uses HDFS directories• In “hadoop” mode, print(X) inside the mapper will fail
the job.• Use: cat(“ERROR!”, file = stderr())
53
Recommended Reading
• http://cran.r-project.org/doc/manuals/R-intro.html• http://blog.revolutionanalytics.com/2013/02/10-r-pac
kages-every-data-scientist-should-know-about.html
• http://had.co.nz/reshape/paper-dsc2005.pdf• http://seananderson.ca/2013/12/01/plyr.html• https://github.com/RevolutionAnalytics/rmr2/blob/m
aster/docs/tutorial.md
• http://cran.r-project.org/web/packages/data.table/index.html
54