6
RHive tutorial - HDFS functions Hive uses Hadoop’s system to process distributed file systems. Thus, in order to expertly use Hive and RHive, you must be able to do things along the lines of using HDFS to put, get, and remove big data. RHive possesses Functions that correspond to what the “hadoop fs” command supports. Using these Functions, a user can in R environment handle HDFS without using HADOOP CLI(command line interface) or Hadoop HDFS library. If you find yourself more comfortable with using “hadoop”’s CLI or Hadoop library then it is also fine to use them. But if you are not familiar with using Rstudio server or working from a terminal, RHive HDFS Functions should prove to be easy-to-use solutions in handling HDFS for R users. Before Emulating this Example rhive.hdfs.* Functions work after RHive has successfully been installed and library(Rhive) and rhive.connect are successfully executed. Let’s not forget to do the following before emulating the example. # Open R library(RHive) rhive.connect() rhive.hdfs.connect In order to use RHive Functions to use HDFS, a connection to hdfs must be established. But if the Hadoop configuration for HDFS is properly set and rhive.connect Function is executed, then this Function will automatically be processed/executed* so there is no need to have this separately executed. If you need to connect to a different HDFS then you can do it like this: rhive.hdfs.connect("hdfs://10.1.1.1:9000") [1] "JavaObject{DFS[DFSClient[clientName=DFSClient_630489789, ugi=root]]}"

RHive tutorial - HDFS functions

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: RHive tutorial - HDFS functions

RHive tutorial - HDFS functions Hive uses Hadoop’s system to process distributed file systems. Thus, in order to expertly use Hive and RHive, you must be able to do things along the lines of using HDFS to put, get, and remove big data. RHive possesses Functions that correspond to what the “hadoop fs” command supports. Using these Functions, a user can in R environment handle HDFS without using HADOOP CLI(command line interface) or Hadoop HDFS library. If you find yourself more comfortable with using “hadoop”’s CLI or Hadoop library then it is also fine to use them. But if you are not familiar with using Rstudio server or working from a terminal, RHive HDFS Functions should prove to be easy-to-use solutions in handling HDFS for R users.

Before Emulating this Example rhive.hdfs.* Functions work after RHive has successfully been installed and library(Rhive) and rhive.connect are successfully executed. Let’s not forget to do the following before emulating the example.

#  Open  R  

library(RHive)  

rhive.connect()  

rhive.hdfs.connect In order to use RHive Functions to use HDFS, a connection to hdfs must be established. But if the Hadoop configuration for HDFS is properly set and rhive.connect Function is executed, then this Function will automatically be processed/executed* so there is no need to have this separately executed.

If you need to connect to a different HDFS then you can do it like this:

rhive.hdfs.connect("hdfs://10.1.1.1:9000")  

[1]  "Java-­‐Object{DFS[DFSClient[clientName=DFSClient_630489789,  ugi=root]]}"  

Page 2: RHive tutorial - HDFS functions

The connection will fail to establish itself if you do not insert the exact hostname and port number servicing HDFS. Ask the system manager if you do not have this information.

rhive.hdfs.ls This does the same thing as "hadoop fs -ls" and this is used like this.

rhive.hdfs.ls("/")  

   permission  owner            group      length            modify-­‐time                file  

1    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  14:27        /airline  

2    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  13:16  /benchmarks  

3    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐06  03:59      /messages  

4    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  22:05                /mnt  

5    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐13  20:24            /rhive  

6    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  20:19                /tmp  

7    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐14  01:14              /user  

This is the same as the command which uses Hadoop CLI.

hadoop  fs  -­‐ls  /  

rhive.hdfs.get The rhive.hdfs.get Function’s role is to bring the data in HDFS to local. This functions in the same way as "hadoop fs -get". The next example entails taking messages data in HDFS and saving them to local system’s /tmp/messages, then checking the number of Records.

rhive.hdfs.get("/messages",  "/tmp/messages")  

Page 3: RHive tutorial - HDFS functions

[1]  TRUE  

system("wc  -­‐l  /tmp/messages")  

145889  /tmp/messages  

rhive.hdfs.put The rhive.hdfs.put Function uploads all data in local to HDFS. This functions like "hadoop fs -put" and opposite of rhive.hdfs.get. The following example uploads the “/tmp/messages” in local system to “/messages_new” in HDFS.

rhive.hdfs.put("/tmp/messages",  "/messages_new")  

rhive.hdfs.ls("/")  

   permission  owner            group      length            modify-­‐time                    file  

1    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  14:27            /airline  

2    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  13:16      /benchmarks  

3    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐06  03:59          /messages  

4    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐14  02:02  /messages_new  

5    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  22:05                    /mnt  

6    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐13  20:24                /rhive  

7    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐14  01:14                  /user  

You can see a new file, "/messages_new", now appears in HDFS.

rhive.hdfs.rm This does the same thing as "hadoop fs -rm", deleting files in HDFS.

Page 4: RHive tutorial - HDFS functions

rhive.hdfs.rm("/messages_new")  

rhive.hdfs.ls("/")  

   permission  owner            group      length            modify-­‐time                file  

1    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  14:27        /airline  

2    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  13:16  /benchmarks  

3    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐06  03:59      /messages  

4    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  22:05                /mnt  

5    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐13  20:24            /rhive  

6    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐14  01:14              /user  

You can see the "/messages_new" file has been deleted from within HDFS.

rhive.hdfs.rename This does the same thing as "hadoop fs -mv". That is, it changes the file name for files in HDFS or moves directories.

rhive.hdfs.rename("/messages",  "/messages_renamed")  

[1]  TRUE  

rhive.hdfs.ls("/")  

   permission  owner            group      length            modify-­‐time                            file  

1    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  14:27                    /airline  

2    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  13:16              /benchmarks  

3    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐06  03:59  /messages_renamed  

4    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  22:05                            /mnt  

Page 5: RHive tutorial - HDFS functions

5    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐13  20:24                        /rhive  

6    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐14  01:14                          /user  

rhive.hdfs.exists This checks whether a file exists within HDFS. There is no corresponding command hadoop that serves as a counterpart.

rhive.hdfs.exists("/messages_renamed")  

[1]  TRUE  

rhive.hdfs.exists("/foobar")  

[1]  FALSE  

rhive.hdfs.mkdirs This does the same thing as "hadoop fs -mkdir". This makes directories in HDFS, even subdirectories.

rhive.hdfs.mkdirs("/newdir/newsubdir")  

[1]  TRUE  

rhive.hdfs.ls("/")  

   permission  owner            group      length            modify-­‐time                            file  

1    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  14:27                    /airline  

2    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  13:16              /benchmarks  

3    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐06  03:59  /messages_renamed  

4    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  22:05                            /mnt  

5    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐14  02:13                      /newdir  

6    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐13  

Page 6: RHive tutorial - HDFS functions

20:24                        /rhive  

7    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐14  01:14                          /user  

rhive.hdfs.ls("/newdir")  

   permission  owner            group  length            modify-­‐time                            file  

1    rwxr-­‐xr-­‐x    root  supergroup            0  2011-­‐12-­‐14  02:13  /newdir/newsubdir  

rhive.hdfs.close This is used to close the connection when you have completed using HDFS and no longer need to use it.

rhive.hdfs.close()