27
RHive : Integrating R and Hive Introduction JunHo Cho Data Analysis Platform Team Friday, November 11, 11

Integrate Hive and R

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Integrate Hive and R

RHive : Integrating R and Hive

Introduction

JunHo Cho

Data Analysis Platform Team

Friday, November 11, 11

Page 2: Integrate Hive and R

Analysis of Data

Friday, November 11, 11

Page 3: Integrate Hive and R

Analysis of Data

MapReduce

Clustering

ClassifierCFDecision Tree

Graph

Recommendation

Friday, November 11, 11

Page 4: Integrate Hive and R

Related Works

• RHIPE

• RHadoop

• hive (Hadoop InteractiVE)

• seuge

Must understa

nd MapReduce

Friday, November 11, 11

Page 5: Integrate Hive and R

RHive is inspired by ...

• Many analysts have been used R for a long time

• Many analysts can use SQL language

• There are already a lot of statistical functions in R

• R needs a capability to analyze big data

• Hive supports SQL-like query language (HQL)

• Hive supports MapReduce to execute HQL

R is the best solution for familiarityHive is the best solution for capability

Friday, November 11, 11

Page 6: Integrate Hive and R

RHive Components

• Hadoop

• store and analyze big data

• Hive

• use HQL instead of MapReduce programming

• R

• support friendly environment to analysts

Friday, November 11, 11

Page 7: Integrate Hive and R

RHive - Architecture

RServe

R Function R Object RUDF RUDAF

01010100101010100101010100101010110101000111

01010100101010100101010100101010110101000111

01010100101010100101010100101010110101000111

Execute R Function Objects and R Objects through Hive Query

Execute Hive Query through Rrcal <- function(arg1,arg2) { coeff * sum(arg1,arg2)}

SELECT R(‘rcal’,col1,col2) from tab1

Friday, November 11, 11

Page 8: Integrate Hive and R

RHive API

• Extension R Functions

• rhive.connect

• rhive.query

• rhive.assign

• rhive.export

• Extension Hive Functions

• RUDF

• RUDAF

• GenericUDTFExpand

• GenericUDTFUnFold

• rhive.napply

• rhive.sapply

• rhive.aggregate

• rhive.list.tables

• rhive.load.table

• rhive.desc.table

Friday, November 11, 11

Page 9: Integrate Hive and R

RUDF - R User-defined Functions

• UDF doesn’t know return type until calling R function

• TYPE : return type

SELECT R(‘R function-name’,col1,col2,...,TYPE)

Example : R function which sums all passed columns

sumCols <- function(arg1,...) {sum(arg1,...)

}rhive.assign(‘sumCols’,sumCols)rhive.exportAll(‘sumCols’,hadoop-clusters)result <- rhive.query(“SELECT R(‘sumCols’, col1, col2, col3, col4, 0.0) FROM tab”)plot(result)

Friday, November 11, 11

Page 10: Integrate Hive and R

RUDAF - R User-defined Aggregation Function

• R can not manipulate large dataset

• Support UDAF’s life cycle

• iterate, partial-merge, merge, terminate

• Return type is only string delimited by ‘,’ - “data1,data2,data3,...”

SELECT RA(‘R function-name’,col1,col2,...)

01010100101010100101010100101010110101000111

01010100101010100101010100101010110101000111

01010100101010100101010100101010110101000111

01010100101010100101010100101010110101000111

01010100101010100101010100101010110101000111

01010100101010100101010100101010110101000111

FUN FUN.partial FUN.merge FUN.terminate

partial aggregation partial aggregationaggregation values

01010100101010100101010100101010110101000111

Friday, November 11, 11

Page 11: Integrate Hive and R

UDTF : unfold and expand

• RUDAF only returns string delimited by ‘,’

• Convert RUDAF’s result to R data.frame

RA(‘newcenter’,...) return “num1,num2,num3” per cluster-key

select unfold(tb1.x,0.0,0.0,0.0,’,’) as (col1,col2,col3) from (select RA(‘newcenter’, attr1,attr2,attr3,attr4) as x from table group by cluster-key

unfold(string_value,type1,type2,...,delimiter)expand(string_value,type,delimiter)

Friday, November 11, 11

Page 12: Integrate Hive and R

napply and sapply

• napply : R apply function for Numeric type

• sapply : R apply function for String type

rhive.napply(table-name,FUN,col1,...)rhive.sapply(table-name,FUN,col1,...)

Example : R function which sums all passed columns

sumCols <- function(arg1,...) {sum(arg1,...)

}result <- rhive.napply(“tab”, sumCols, col1, col2, col3, col4)rhive.load.table(result)

Friday, November 11, 11

Page 13: Integrate Hive and R

napply

rhive.napply <- function(tablename, FUN, col = NULL, ...) { if(is.null(col)) cols <- "" else cols <- paste(",",col)

for(element in c(...)) { cols <- paste(cols,",",element) } exportname <- paste(tablename,"_sapply",as.integer(Sys.time()),sep="")

! rhive.assign(exportname,FUN)! rhive.exportAll(exportname)

tmptable <- paste(exportname,”_table”)

! rhive.query( paste("CREATE TABLE ", tmptable," AS SELECT ","R('",exportname,"'",cols,",0.0) FROM ",tablename,sep=""))

! tmptable}

• ‘napply’ is similar to R apply function

• Store big result to HDFS as Hive table

Friday, November 11, 11

Page 14: Integrate Hive and R

aggregate

• RHive aggregation function to aggregate data stored in HDFS using HIVE Function

rhive.aggregate(table-name,hive-FUN,...,goups)

Example : Aggregate using SUM (Hive aggregation function)

result <- rhive.aggregate(“emp”, “SUM”, sal,groups=”deptno”)rhive.load.table(result)

Friday, November 11, 11

Page 15: Integrate Hive and R

Examples - predict flight delaylibrary(RHive)

rhive.connect()

- Retrieve training set from large dataset stored in HDFS

train <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand())

train$arrdelay <- as.numeric(train$arrdelay)

train$distance <- as.numeric(train$distance)

train <- train[!(is.na(train$arrdelay) | is.na(train$distance)),]

model <- lm(arrdelay ~ distance + dayofweek,data=train)

- Export R object data

rhive.assign("model", model)

- Analyze big data using model calculated by R

predict_table <- rhive.napply(“airlines”,function(arg1,arg2,arg3) {

if(is.null(arg1) | is.null(arg2) | is.null(arg3)) return(0.0)

res <- predict.lm(model, data.frame(dayofweek=arg1,arrdelay=arg2,distance=arg3))

return(as.numeric(res)) }, ‘dayofweek’, ‘arrdelay’, ‘distance’)

Native R code

HiveQuery + R code

Friday, November 11, 11

Page 16: Integrate Hive and R

DEMO

Friday, November 11, 11

Page 17: Integrate Hive and R

Conclusion

• RHive supports HQL, not MapReduce model style

• RHive allows analytics to do everything in R console

• RHive interacts R data and HDFS data

• Future & Current Works

• Integrate Hadoop HDFS

• Support Transform/Map-Reduce Scripts

• Distributed Rserve

• Support more R style API

• Support machine learning algorithms (k-means, classifier, ...)

Friday, November 11, 11

Page 18: Integrate Hive and R

Cooperators

• JunHo Cho

• Seonghak Hong

• Choonghyun Ryu

YOU!

Friday, November 11, 11

Page 19: Integrate Hive and R

How to join RHive project

• Logo

• github (https://github.com/nexr/RHive)

• CRAN (http://cran.r-project.org/web/packages/RHive)

• Welcome to join RHive project

Friday, November 11, 11

Page 20: Integrate Hive and R

References

• Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf)

• RHIPE (http://ml.stat.purdue.edu/rhipe)

• Hive (http://hive.apache.org)

• Parallels R by Q. Ethan McCallum and Stephen Weston

Friday, November 11, 11

Page 22: Integrate Hive and R

Appendix

Friday, November 11, 11

Page 23: Integrate Hive and R

RHIPE

• the R and Hadoop Integrated Processing Environment

• Must understand the MapReduce model

R

HDFS

Mapper Reducer

R R

PersonalServer

ProtocolBuf

RHMR

Fork

shuffle / sort

R Objects (map, reduce)R Conf

map <- function() {...}reduce <- function() {...}rmr <- rhmr(map,reduce,...)

R Objects (map) R Objects (reduce)

Friday, November 11, 11

Page 24: Integrate Hive and R

RHadoop

• Manipulate Hadoop data stores and HBASE directly from R

• Write MapReduce models in R using Hadoop Streaming

• Must understand the MapReduce model

R

HDFS

Mapper Reducer

R Rrmr

shuffle / sortmanipulate

map <- function() {...}reduce <- function() {...}mapreduce(input,output,map,reduce,...)

Hadoop Streaming

rhdfs

HBASE

rhbase

store R objecs as file

execute hadoop streaming

Friday, November 11, 11

Page 25: Integrate Hive and R

hive(Hadoop InteractiVE)

• R extension facilitating distributed computing via the MapReduce paradigm

• Provide an interface to Hadoop, HDFS and Hadoop Streaming

• Must understand the MapReduce model

R

HDFS

Mapper Reducer

R Rhive_stream

shuffle / sortmanipulate

map <- function() {...}reduce <- function() {...}hive_stream(map,reduce,...)

Hadoop Streaming

DFS

save R script on local

execute hadoop streaming

hive

Friday, November 11, 11

Page 26: Integrate Hive and R

seuge

• Simple parallel processing with a fast and easy setup on Amazon’s WS.

• Parallel lapply function for the EMR engine using Hadoop streaming.

• Does not support MapReduce model but only Map model.

Amazon S3

R

emrlapply

save R objects (data + FUN) on local

data <- list(...)emrlapply(clusterObject,data,FUN,..)

awsFunctions

bootstrap (setup R)mapper.R

EMR

upload R objects

Mapper

R

Hadoop Streaming

Friday, November 11, 11

Page 27: Integrate Hive and R

Ricardo

• Integrate R and Jaql (JSON Query Language)

• Must know how to use uncommon query, Jaql

• Not open-source

Ref : Ricardo-SIGMOD10

Friday, November 11, 11