R hive tutorial - udf, udaf, udtf functions

RHive tutorial - advanced functions RHive supports basic Functions, HDFS, “apply”-like Functions for map/reducing, and advanced Functions that support UDF, UDAF, and UDTF. Advanced Functions are composed of UDF/UDAF Functions used to implement Hive’s UDF and UDAF with R.

If you use these Functions you can apply R down to the lower levels of complexity** and perform Map/Reduce programming that uses Hadoop and Hive. You can also use R to deal with a large portion of complex algorithms or data processing procedures in Data Mining.

RHive - UDF and UDAF functions Hive provides UDF(User Defined Function), UDAF(User Defined Aggregate Function), and UDTF(User Defined Table create Function). UDF refers to Functions contained in Hive that uses SQL syntax such as count, avg, min, and max to perform calculations for one column or multiple columns. Hive comes with support for several UDFs and allows users to add more. But originally, UDF, UDAF, and UDTF must all be written in Java and cannot be directly made with the R language. RHive can write UDF and UDAF into R language and make them useable in Hive, and take the UDTF readymade by RHive and use them for split a complex column into appropriated columns.

This tutorial explains how to implement these UDFs and UDAFs through examples using RHive.

rhive.assign The rhive.assign Function assigns the Functions and variables made in R so that they may be referenced from Hive. The reason why such assigning is required is: we need to select variables and Functions to use for distributed environment. One thing to beware is rhive.assign only performs essential tasks before deploying the Functions, variables, and Objects in distributed environment, but rhive.assign doesn’t deploy to distributed environments. It’s just for making preparations and its normally used with rhive.export or rhive.exportAll, which does the actual deployment. Deployment here means, like Hadoop’s Map/Reduce, putting the Functions and Objects to use in Hive in standby within Job node, and load them whenever needed for processing data.

The rhive.assign Function takes in 2 arguments, the first of which is the alias of the character type of the symbol created in R, and the other argument is the symbol to be deployed. If the symbol that will be deployed into the distributed environment has the name, “newsum”, then use rhive.assign to assign it like this:

newsum <-‐ function(value) {

value + 1

}

rhive.assign("newsum", newsum)

The syntax may seem a bit odd but this is due to R’s structural problems and will be improved in the future. The first argument just needs to make the symbol that will enter as the second argument into a string. Normally, inserting a string same as the symbol makes working with it easier, but this is not an obligation. However, in the case of changing the name you must keep in mind that the assigned symbol’s name will change during use: just for the sake of confusion it’s best not to change that name.

Look at the following example to figure out how to use rhive.assign.

The following example makes a Function, “sum3values”, and uses rhive.assign to deploy it to a distributed environment.

sum3values <-‐ function(a,b,c) {

a + b + c

}

rhive.assign("sum3values", sum3values)

[1] TRUE

You can also assign objects that are not Functions.

coef1 <-‐ 3.141593

rhive.assign("coef1", coef1)

Anything, including dataframes, are viable as well.

library(MASS)

> head(cats)

Sex Bwt Hwt

1 F 2.0 7.0

2 F 2.0 7.4

3 F 2.0 9.5

4 F 2.1 7.2

5 F 2.1 7.3

6 F 2.1 7.6

> rhive.assign("cats", cats)

[1] TRUE

Objects and dataframe can also be used with rhive.assign. But if these objects become deployed to the distributed environment then in that environment they don’t exist as shared memory but as local memory, so be careful.

rhive.export rhive.export prepares objects made in R by actually deploying them. rhive.export is used along with rhive.assign and the names of the objects it gets for arguments must be ready to be deployed via rhive.assign.

You can easily learn how it is used from the following example.

sum3values <-‐ function(a,b,c) {

a + b + c

}

rhive.assign("sum3values", sum3values)

rhive.export(sum3values)

Aside from the first argument, exportname, rhive.export has many other arguments. But these are for server and ports that are separately assignable so as to allow them to be used in a complexly set up environment—hence normal

circumstances do not require them. If you need to set up an environment complex enough to warrant the use of these things then peruse a manual or contact the RHive development team.

rhive.exportAll Although rhive.exportAll is functionally similar to rhive.export, their difference lies in that rhive.exportAll serves to entirely deploys all symbols starting with the same string for the first argument. This function serves to deploy UDAF Functions written in R. UDAF Functions, written in R, have a limitation of 4 Functions starting with the same name. Made to easily deploy these, UDAF Functions are a type of Function from rhive.export.

The following example shows the making of 4 Functions and deploying them to a distributed environment.

sumAllColumns <-‐ function(prev, values) {

if (is.null(prev)) {

prev <-‐ rep(0.0, length(values))

}

prev + values

}

sumAllColumns.partial <-‐ function(values) {

values

}

sumAllColumns.merge <-‐ function(prev, values) {



}

prev + values

}

sumAllColumns.terminate <-‐ function(values) {

values

}

rhive.assign("sumAllColumns", sumAllColumns)

rhive.assign("sumAllColumns.partial", sumAllColumns.partial)

rhive.assign("sumAllColumns.merge", sumAllColumns.merge)

rhive.assign("sumAllColumns.terminate", sumAllColumns.terminate)

rhive.exportAll("sumAllColumns")

The last line is actually same as the following:

# rhive.exportAll("sumAllColumns")

rhive.export("sumAllColumns")

rhive.export("sumAllColumns.partial")

rhive.export("sumAllColumns.merge")

rhive.export("sumAllColumns.terminate")

rhive.assign, rhive.export, and rhive.exportAll only serve up until the task of preparing the Functions and Objects to be sent to Rserves currently in the distributed environment. In order to actually use these R Functions and Objects then you need to write them into the SQL syntax that will be taken by rhive.query.

In the following examples, you will learn how to run the deployed Functions.

RHive - UDF usage Now you must prepare a table for applying UDFs. If there is no proper table in Hive for UDFs, create one. This tutorial will convert data called USArrests into a Hive table. USArrests is a small bit of data which R possesses by default.

Like below, convert the USArrests into Hive table and save it.

rhive.write.table(USArrests)

rhive.query("SELECT * FROM USArrests LIMIT 10")

rowname murder assault urbanpop rape

1 Alabama 13.2 236 58 21.2

2 Alaska 10.0 263 48 44.5

3 Arizona 8.1 294 80 31.0

4 Arkansas 8.8 190 50 19.5

5 California 9.0 276 91 40.6

6 Colorado 7.9 204 78 38.7

7 Connecticut 3.3 110 77 11.1

8 Delaware 5.9 238 72 15.8

9 Florida 15.4 335 80 31.9

10 Georgia 17.4 211 60 25.8

The next example counts the total number of records in the USArrests table.

rhive.query("SELECT COUNT(*) FROM USArrests")

X_c0

1 50

The COUNT Function used in the above example is a Function used in SQL and is also a Hive UDF. Users familiar with the SQL syntax would know it as the most used included in SQL Function. Now you will be introduced to how to write and run a Function in R that does the same thing as what the COUNT Function does.

First look at USArrests table’s description like below.

rhive.desc.table("USArrests")

col_name data_type comment

1 rowname string

2 murder double

3 assault int

4 urbanpop int

5 rape double

Now we’ll make and execute a Function that gets the total sum of all the values of the murder, assault, and rape columns.

The following is the entire code for that.

library(RHive)

rhive.connect()

sumCrimes <-‐ function(column1, column2, column3) {

column1 + column2 + column3

}

rhive.assign("sumCrimes", sumCrimes)

rhive.export("sumCrimes")

rhive.query("SELECT rowname, urbanpop, R('sumCrimes', murder, assault, rape, 0.0) FROM usarrests")

rhive.close()

The results are as follows.

rhive.query("SELECT rowname, urbanpop, R('sumCrimes', murder, assault, rape, 0.0) AS crimes FROM usarrests")

rowname urbanpop crimes

1 Alabama 58 270.4

2 Alaska 48 317.5

3 Arizona 80 333.1

...

48 West Virginia 39 96.0

49 Wisconsin 66 66.4

50 Wyoming 60 183.4

The important thing about the example above, is the Function named “R()” written within the SQL syntax. This Function is Hive’s UDF Function, not R’s Function. More precisely, it’s a Function added onto Hive by RHive to allow RHive to process R’s Functions in Hive. R() Functions are Functions like sum, avg, or min. The R() Function calls its first argument and receive returned values and send them to Hive. So, to explain the R() Function used in the SQL syntax below:

SELECT rowname, urbanpop, R('sumCrime', murder, assault, rape, 0.0) FROM usarrests

‘sumCrime’ is the name of the R Function deployed by rhive.export, and the murder, assault and rape that follows ‘sumCrime’ are USArrests Hive table column names. And 0.0, the last argument entered into the R() Function, is the type of the value which R() Function will return. Enter 0.0 if the subCrime Function will return a numeric value and enter “” if it will return a character type. For example if the R Function, subCrime, returns a value of the character type then enter the following:

rhive.query("SELECT rowname, urbanpop, R('sumCrime', murder, assault, rape, \"\") FROM usarrests")

At the end, the syntax below returns a result composed of 3 columns, just like it was seen in the aforementioned result.

SELECT rowname, urbanpop, R('sumCrime', murder, assault, rape, 0.0) FROM usarrests

Beware, R() Function returns only 1 value in this particular example. That is, if you look at the SQL syntax results in the example above, you can see R() Function ends up creating a new column value.

Actually the SQL syntax used above can actually be processed with nothing more than Hive SQL. The two syntaxes below actually show the same results.

RHive UDF SQL

rhive.query("SELECT rowname, urbanpop, R('sumCrime', murder, assault, rape, \"\") FROM usarrests")

Hive SQL

rhive.query("SELECT rowname, urbanpop, murder + assault + rape AS crimes FROM usarrests")

This tutorial uses something not useful for the sake of presenting an easy-to-learn example. Hive supports many arithmetics for UDF and column.

If you use RHive to process massive data and there is a solution Hive SQL already supports it, it is recommended to use that solution. The following URL contains relevant details.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

If Hive does not support the feature but you still need to perform complex calculations for multiple columns, data mining, or machine learning then RHive UDF feature would be very useful for such tasks.

RHive - UDAF 1 As something similar to UDF, there is something called UDAF. UDAF is there to support aggregation. It is a Function that enables processing of data from combined by column in “group by” phrase in SQL syntax. Like UDF, UDAF also needs to create a module in Java to add new things to Hive, but RHive enables the use of R language in creating them.

This wasn’t explained but you have already seen 4 UDAF Functions called sumAllcolumns. Functions that start with the name, sumAllcolumns, are Functions that have to do with adding all the values of inputted arguments. You may find yourself asking several questions.

First of them may be, “Why does the UDAF simultaneously need 4 Functions?” and second may be, “Where can UDAF be used?”

To best understand this, it’s better for you to take a look at the runnable code.

First, in order to make a table for suitable application of Functions made for UDAF, take the iris data which R’s datasets contain by default and upload it to Hive.

rhive.write.table(iris)

[1] "iris"

> rhive.list.tables()

tab_name

1 aids2

2 iris

3 new_usarrests

4 usarrests

rhive.desc.table("iris")

col_name data_type comment

1 rowname string

2 sepallength double

3 sepalwidth double

4 petallength double

5 petalwidth double

6 species string

rhive.query("SELECT * FROM iris LIMIT 10")

rowname sepallength sepalwidth petallength petalwidth species

1 1 5.1 3.5 1.4 0.2 setosa

2 2 4.9 3.0 1.4 0.2 setosa

3 3 4.7 3.2 1.3 0.2 setosa

4 4 4.6 3.1 1.5 0.2 setosa

5 5 5.0 3.6 1.4 0.2 setosa

6 6 5.4 3.9 1.7 0.4 setosa

7 7 4.6 3.4 1.4 0.3 setosa

8 8 5.0 3.4 1.5 0.2 setosa

9 9 4.4 2.9 1.4 0.2 setosa

10 10 4.9 3.1 1.5 0.1 setosa

You can gain a general view of how iris data is composed.

Now we shall take those with the same values in the species column, gather them up, and get the sum of each column’s values. The entirety of the completed code, after running, should result like this:




}

prev + values

}


values

}




}

prev + values

}


values

}






result <-‐ rhive.query("

SELECT species, RA('sumAllColumns', sepallength, sepalwidth, petallength, petalwidth)

FROM iris

GROUP BY species")

print(result)

species

1 setosa

2 versicolor

3 virginica

X_c1

1 250.29999999999998,171.40000000000003,73.10000000000001,12.299999999999995

2 296.8,138.50000000000003,212.99999999999997,66.3

3 329.3999999999999,148.7,277.59999999999997,101.29999999999998

If you take a look at the last printed results from the run results, you will see 3 newly created records each with 2 columns.

First thing to note is the RA() Function which is in the SQL syntax which in turn is sent via rhive.query. This is similar to the R() Function.

RA() Function’s returns only one value, and is always of the character type. And Hive processes the returned results and finally sends them to RHive. One thing to beware is, RA() Function is UDAF so you must use SQL’s GROUP BY syntax along with it. Peruse the Hive document for further details.

To explain the SQL syntax above in detail:

SELECT species, RA('sumAllColumns', sepallength, sepalwidth, petallength, petalwidth)

FROM iris

GROUP BY species

It’s basically making a separate calculation for the values of columns with the “species” name which should be aggregated to respective groups using “GROUP BY”.

The first argument sumAllColumns is the common prefix of 4 UDAF Functions. The rest are columns to be processed. And the RA Function returns a value using all the arguments.

By now, you probably have unanswered questions raised and perhaps new ones as well. Those yet to be explained will be explained in the next section.

RHive - UDAF 2 Now we’ll talk about the 4 UDAF R Functions aforementioned in the previous examples and their disadvantages.

A total of 4 Functions were made for UDAF and they are:

• sumAllColumns • sumAllColumns.partial • sumAllColumns.merge • sumAllColumns.terminate

UDAF Functions for RHive must have a common prefix. Each 4 Functions should end with .partial, .merge, .terminate, and one without any postfix. The naming rule must be kept in RHive.

The reason why 4 Functions are required is because there are 4 points in the Map/Reduce procedure where the Functions can run while Hive UDAF is running. Suppose you made 4 UDAF Functions that begin with the name “foobar”. Each Function will run in the following locations:

• foobar – where Map’s aggregation is done (The combine step, to be precise) • foobar.partial – Where the aggregated result is sent to reduce • foobar.merge – Where Reduce’s aggregation is done • foobar.terminate – Where Reduce is terminated.

And foobar and foobar.merge do similar things. They have to take 2 arguments. foobar.partial and foobar.terminate are also similar but they each have to take only 1 argument.

Now you might still have question as to how these functions work. To gain an understanding of this, you must grasp the flow of their workings.

• foobar - combine aggregation (map) • foobar.partial - combine return (map) • foobar.merge - reduce aggregation • foobar.terminate - reduce terminate

The 2 combine steps might pass. If you do not become exact with your configurations then it will be difficult to figure out when these Functions pass and run. Thus making all 4 Functions is a must. For a more advanced knowhow and complete understanding, peruse the Hive and Hadoop documents.

If you want to forgo complete understanding and just want to know how to use it, you must remember this: in order to use RHive’s UDAF support, you need to make 4 Functions and follow the rules.

RHive - UDAF 3 This is about the workings and arguments of the 4 UDAF Functions. From the aforementioned sumAllColumns Functions, take a look at

sumAllColumns and sumAllColumns.merge. The two Functions have the same code.




}

prev + values

}




}

prev + values

}

These Functions need to process 2 arguments. The first argument is the returned values from sumAllColumns.merge and sumAllColumns. The second argument is the value regarding record which Hive hands over. The first and the second arguments are actually all vector or list. Because it is handed over as vector, you need to remember the sequence of the columns inputted into the SQL syntax.

The “prev” handed over as the first argument is actually a returned value from the previous step. So when the Function is first run it is sent over a NULL value. Thus within the Function, you can use is.null to see whether it is a NULL value.



}

So the values that are recursively processed and returned from the last Record, are sent over as the arguments for sumAllColumns.partial and

sumAllColumns.terminate. The two Functions’ codes are identical.


values

}


values

}

These two Functions do not recursively run, therefore they can only receive one argument. And the result of sumAllColumns.terminate, is sent to Hive and forms one column value.

In the above example you can each see 2 Functions with identical codes, but this is an easy example made for the sake of making a simple tutorial. In actual practice, the 4 Functions’ codes may all differ depending on context. Even if all 4 Functions have identical codes, there must always be 4 Functions.

RHive - UDTF This section is devoted to streamline the codes you’ve written so far to be more graceful. It is quite difficult to handle something that just comes out as a single string value. There is a need to split it to its individual constituents and the unfold Function, a UDTF Function supported by RHive, is proper for this.

The result of this is the result of running SQL syntax shown in a previous example.

print(result)

species

1 setosa

2 versicolor

3 virginica

X_c1

1 250.29999999999998,171.40000000000003,73.10000000000001,12.299999999999995

2 296.8,138.50000000000003,212.99999999999997,66.3

3 329.3999999999999,148.7,277.59999999999997,101.29999999999998

The 2nd column, X_c1, is a value made by UDAF and it consists of character type. You can also see the values are distinguished by “,”s between them. To make this back into a numeric vector, R Functions like strsplit must be used. However, even if there are no problems with using that when there is a small number of Records, a problem occurs otherwise. The example above only has 3 Records but when applying the same procedure for big tables, you might encounter millions of Records. Hence the values returned by UDAF must be each split and made into column values.

To do this you need subquery and UDTF, and the code altered after using UDTF is as such:




}

prev + values

}


values

}




}

prev + values

}


values

}






result <-‐ rhive.query(

"SELECT unfold(dummytable.dummycolumn, 0.0, 0.0, 0.0, 0.0, ',')

AS (sepallength, sepalwidth, petallength, petalwidth)

FROM (

SELECT RA('sumAllColumns', sepallength, sepalwidth, petallength, petalwidth) AS dummycolumn

FROM iris

GROUP BY species

) dummytable")

print(result)

sepallength sepalwidth petallength petalwidth

1 250.3 171.4 73.1 12.3

2 296.8 138.5 213.0 66.3

3 329.4 148.7 277.6 101.3

SQL syntax** became a bit complex compared to prior examples, but in the final result, you can see that the UDAF return values are all split into columns by the “unfold” UDTF. Unfold is the UDTF Function supported by RHive, so there is no need to separately apply R code. And Hive’s UDTF must be used alone. This is not solvable in RHive because this depends on Hive.

Examples so far only performed Map/Reduce once. If you are trying to use RHive for a very complex task then you may need multiple Map/Reduces. This is often seen in normal Map/Reduce implementations or Hadoop streaming implementations. If you attempt to combine these Map/Reduces, you may need to make temporary tables to save results and then delete them later.

Technology

R hive tutorial - udf, udaf, udtf functions