Upload
aiden-seonghak-hong
View
3.473
Download
2
Embed Size (px)
DESCRIPTION
Citation preview
RHive tutorial - HDFS functions Hive uses Hadoop’s system to process distributed file systems. Thus, in order to expertly use Hive and RHive, you must be able to do things along the lines of using HDFS to put, get, and remove big data. RHive possesses Functions that correspond to what the “hadoop fs” command supports. Using these Functions, a user can in R environment handle HDFS without using HADOOP CLI(command line interface) or Hadoop HDFS library. If you find yourself more comfortable with using “hadoop”’s CLI or Hadoop library then it is also fine to use them. But if you are not familiar with using Rstudio server or working from a terminal, RHive HDFS Functions should prove to be easy-to-use solutions in handling HDFS for R users.
Before Emulating this Example rhive.hdfs.* Functions work after RHive has successfully been installed and library(Rhive) and rhive.connect are successfully executed. Let’s not forget to do the following before emulating the example.
# Open R
library(RHive)
rhive.connect()
rhive.hdfs.connect In order to use RHive Functions to use HDFS, a connection to hdfs must be established. But if the Hadoop configuration for HDFS is properly set and rhive.connect Function is executed, then this Function will automatically be processed/executed* so there is no need to have this separately executed.
If you need to connect to a different HDFS then you can do it like this:
rhive.hdfs.connect("hdfs://10.1.1.1:9000")
[1] "Java-‐Object{DFS[DFSClient[clientName=DFSClient_630489789, ugi=root]]}"
The connection will fail to establish itself if you do not insert the exact hostname and port number servicing HDFS. Ask the system manager if you do not have this information.
rhive.hdfs.ls This does the same thing as "hadoop fs -ls" and this is used like this.
rhive.hdfs.ls("/")
permission owner group length modify-‐time file
1 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 14:27 /airline
2 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 13:16 /benchmarks
3 rw-‐r-‐-‐r-‐-‐ root supergroup 11186419 2011-‐12-‐06 03:59 /messages
4 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 22:05 /mnt
5 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐13 20:24 /rhive
6 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 20:19 /tmp
7 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐14 01:14 /user
This is the same as the command which uses Hadoop CLI.
hadoop fs -‐ls /
rhive.hdfs.get The rhive.hdfs.get Function’s role is to bring the data in HDFS to local. This functions in the same way as "hadoop fs -get". The next example entails taking messages data in HDFS and saving them to local system’s /tmp/messages, then checking the number of Records.
rhive.hdfs.get("/messages", "/tmp/messages")
[1] TRUE
system("wc -‐l /tmp/messages")
145889 /tmp/messages
rhive.hdfs.put The rhive.hdfs.put Function uploads all data in local to HDFS. This functions like "hadoop fs -put" and opposite of rhive.hdfs.get. The following example uploads the “/tmp/messages” in local system to “/messages_new” in HDFS.
rhive.hdfs.put("/tmp/messages", "/messages_new")
rhive.hdfs.ls("/")
permission owner group length modify-‐time file
1 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 14:27 /airline
2 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 13:16 /benchmarks
3 rw-‐r-‐-‐r-‐-‐ root supergroup 11186419 2011-‐12-‐06 03:59 /messages
4 rw-‐r-‐-‐r-‐-‐ root supergroup 11186419 2011-‐12-‐14 02:02 /messages_new
5 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 22:05 /mnt
6 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐13 20:24 /rhive
7 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐14 01:14 /user
You can see a new file, "/messages_new", now appears in HDFS.
rhive.hdfs.rm This does the same thing as "hadoop fs -rm", deleting files in HDFS.
rhive.hdfs.rm("/messages_new")
rhive.hdfs.ls("/")
permission owner group length modify-‐time file
1 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 14:27 /airline
2 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 13:16 /benchmarks
3 rw-‐r-‐-‐r-‐-‐ root supergroup 11186419 2011-‐12-‐06 03:59 /messages
4 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 22:05 /mnt
5 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐13 20:24 /rhive
6 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐14 01:14 /user
You can see the "/messages_new" file has been deleted from within HDFS.
rhive.hdfs.rename This does the same thing as "hadoop fs -mv". That is, it changes the file name for files in HDFS or moves directories.
rhive.hdfs.rename("/messages", "/messages_renamed")
[1] TRUE
rhive.hdfs.ls("/")
permission owner group length modify-‐time file
1 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 14:27 /airline
2 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 13:16 /benchmarks
3 rw-‐r-‐-‐r-‐-‐ root supergroup 11186419 2011-‐12-‐06 03:59 /messages_renamed
4 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 22:05 /mnt
5 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐13 20:24 /rhive
6 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐14 01:14 /user
rhive.hdfs.exists This checks whether a file exists within HDFS. There is no corresponding command hadoop that serves as a counterpart.
rhive.hdfs.exists("/messages_renamed")
[1] TRUE
rhive.hdfs.exists("/foobar")
[1] FALSE
rhive.hdfs.mkdirs This does the same thing as "hadoop fs -mkdir". This makes directories in HDFS, even subdirectories.
rhive.hdfs.mkdirs("/newdir/newsubdir")
[1] TRUE
rhive.hdfs.ls("/")
permission owner group length modify-‐time file
1 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 14:27 /airline
2 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 13:16 /benchmarks
3 rw-‐r-‐-‐r-‐-‐ root supergroup 11186419 2011-‐12-‐06 03:59 /messages_renamed
4 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐07 22:05 /mnt
5 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐14 02:13 /newdir
6 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐13
20:24 /rhive
7 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐14 01:14 /user
rhive.hdfs.ls("/newdir")
permission owner group length modify-‐time file
1 rwxr-‐xr-‐x root supergroup 0 2011-‐12-‐14 02:13 /newdir/newsubdir
rhive.hdfs.close This is used to close the connection when you have completed using HDFS and no longer need to use it.
rhive.hdfs.close()