Background - University of New South Wales · 2016-09-11 · Aims This exercise aims to get you to: Import data into HBase using bulk load Read MapReduce input from HBase and write

Aims

This exercise aims to get you to:

Import data into HBase using bulk load

Read MapReduce input from HBase and write MapReduce output to

HBase

Manage data using Hive

Manage data using Pig

Background

In HBase-speak, bulk loading is the process of preparing and loading HFiles

(HBase’s own file format) directly into the RegionServers. Bulk load steps:

1. Extract the data from a source, typically text files or another database.

2. Transform the data into HFiles. This step requires a MapReduce job and for

most input types you will have to write the Mapper yourself. The job will

need to emit the row key as the Key, and either a KeyValue, a Put, or a Delete

as the Value. The Reducer is handled by HBase; you configure it using

HFileOutputFormat2.configureIncrementalLoad().

3. Load the files into HBase by telling the RegionServers where to find them. It

requires using LoadIncrementalHFiles (more commonly known as the

completebulkload tool), and by passing it a URL that locates the files in HDFS,

it will load each file into the relevant region via the RegionServer that serves

it.

Here’s an illustration of this process. The data flow goes from the original

source to HDFS, where the RegionServers will simply move the files to

their regions’ directories.

See more details at: http://blog.cloudera.com/blog/2013/09/how-to-use-

hbase-bulk-loading-and-why/.

http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/

http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/

Because HBase is not installed in the VM image in the lab computers, you

need to install HBase again following the instructions in Lab 5.

Create a project “Lab6” and create a package “comp9313.lab6” in this

project. Put all your java codes in this package and keep a copy. Right click

the project -> Properties -> Java Build Path -> Libraries -> Add Externals

JARs -> go to the folder “comp9313/base-1.2.2/lib”, and add all the jar files

to the project.

Data Set

Download the two files “Votes” and “Posts” from the course homepage. The

data set contains many questions asked on http://www.stackexchange.com

and the corresponding answers. The two file used in this week’s lab are

obtained at: https://archive.org/details/stackexchange, part of

“datascience.stackexchange.com.7z”. The format of the data set is shown at:

https://ia800500.us.archive.org/22/items/stackexchange/readme.txt.

The data format of Votes is (the field BountyAmount is ignored):

- **votes**.xml

- Id

- PostId

- VoteTypeId

- ` 1`: AcceptedByOriginator

- ` 2`: UpMod

- ` 3`: DownMod

- ` 4`: Offensive

- ` 5`: Favorite - if VoteTypeId = 5 UserId will be populated

- ` 6`: Close

- ` 7`: Reopen

- ` 8`: BountyStart

- ` 9`: BountyClose

- `10`: Deletion

- `11`: Undeletion

- `12`: Spam

- `13`: InformModerator

- CreationDate

- UserId (only for VoteTypeId 5)

- BountyAmount (only for VoteTypeId 9)

For example:

The data format of Comments is:

http://www.stackexchange.com/

https://archive.org/details/stackexchange

https://ia800500.us.archive.org/22/items/stackexchange/readme.txt

- **comments**.xml

- Id

- PostId

- Score

- Text, e.g.: "@Stu Thompson: Seems possible to me - why not

try it?"

- CreationDate, e.g.:"2008-09-06T08:07:10.730"

- UserId

For example:

HBase Data Bulk Load

Import “Votes” as a table in HBase.

1. HBase will use a “staging” folder to store temporary data, and we need to

configure this directory for HBase. Create a folder /tmp/hbase-staging in

HDFS, and change its mode to 711 (i.e., rwx—x—x).

$ hdfs dfs –mkdir /tmp/hbase-staging

$ hdfs dfs –chmod 711 /tmp/hbase-staging

Add the following lines to $HBASE_HOME/conf/hbase-site.xml (in

between <configuration> and </configuration>:

<property>

<name>hbase.bulkload.staging.dir</name>

<value>/tmp/hbase-staging</value>

</property>

<property>

<name>hbase.coprocessor.region.classes</name>

<value>org.apache.hadoop.hbase.security.token.TokenProvider,org.apach

e.hadoop.hbase.security.access.AccessController,org.apache.hadoop.hba

se.security.access.SecureBulkLoadEndpoint</value>

</property>

In your MapReduce code, you need to configure the two properties:

“hbase.fs.tmp.dir” and “hbase.bulkload.staging.dir”. After creating a

Configuration object, you need to:

Configuration conf = HBaseConfiguration.create();

conf.set(“hbase.fs.tmp.dir”, “/tmp/hbase-staging”);

conf.set(“hbase.bulkload.staging.dir”, “/tmp/hbase-staging”);

2. The code for bulk loading Votes into HBase is available at the course

homepage, i.e., “Vote.java” and “HBaseBulkLoadExample.java”. Below

lists some explanations of the code:

Only the mapper is required in bulk load, because the Reducer is handled by

HBase and you configure it using

HFileOutputFormat2.configureIncrementalLoad(). The map output key data

type must be ImmutableBytesWritable, and the map output value data type

can only be a KeyValue/Put/Delete object. In this example, you create a Put

object, which will be used to insert the data into the HBase table.

The table can either be created using HBase shell or HBase Java API. In the

give code, the table is created using Java API.

In the example code, the class HBaseBulkLoadExample implements the

interface Tool, and the job is configured and started in the run() function.

Then, ToolRunner.run() is used to invoke HBaseBulkLoadExample.run(). You

can also configure and start the job in the main function, as you did in the

previous labs on MapReduce.

Before starting the job, you need to use

HFileOutputFormat2.configureIncrementalLoad() to configure the bulk load.

After the job is completed, that is, the mapper generate the Put objects for

all input data, you use LoadIncrementalHFiles to do the bulk load. It is the

tool to load the output of HFileOutputFormat2 into an existing table.

3. After “Votes” is loaded into the table “votes”, open the HBase shell to

check the table and its contents.

Your Task: Import “Comments” as a table in HBase.

Create a class “HBaseBulkLoadComments.java” and a class “Comment.java”

in package “comp9313.lab6” to finish this task.

Use “Id” as the rowkey, and create three column families, “postInfo”

(containing PostId), “commentInfo” (containing Score, Text, and

CreationDate), and “userInfo” (containing “UserId”).

Read MapReduce Input from HBase

Problem 1.

Read input data from table “votes” in HBase, and count for each post the

number of each type of vote for this post. The output data is of format:

(PostID, {<VoteTypeId, count>}).

For example, if post with ID “1” has two votes, one is of type “1” and

another is of type “2”, then you should output (1, {<1, 1>, <2, 1>}).

Please refer to https://hbase.apache.org/book.html#mapreduce.example for

the examples of HBase MapReduce read.

Hints:

1. Your mapper should be extended from TableMapper<K, V>. The input key

data type is ImmutableBytesWritable, and value data type is Result. Each

map() function will read one row from the HBase table, and you can use

Result.getValue(CF, COLUMN) to get the value in a cell. Your mapper code

will be like:

public static class AggregateMapper extends

TableMapper<Text, Text>{

public void map(ImmutableBytesWritable row, Result

value, Context context) throws IOException,

InterruptedException {

… //do your job

}

}

2. The reducer is just like a normal MapReduce reducer

3. In the main function, you will need to use the function

TableMapReduceUtil.initTableMapperJob() to configure the mapper.

4. Because the data is read from HBase, you do not need to configure the

data input path. You only need to specify the output path in Eclipse.

The code “ReadHBaseExample.java” is available at the course webpage.

Try to write the mapper by yourself, and learn how to configure the HBase

read job from that file.

Problem 2:

Read input data from table “comments” in HBase, and calculate the number

of comments per UserID. Refer to the code “ReadHBaseExample.java” and

write your code in “ReadHBaseComment.java” in package

“comp9313.lab6”.

Write MapReduce Output to HBase

Problem 1.

Read input data from “Votes”, and count the number of votes per user. The

result will be written to an HBase table ‘votestats’, rather than storing in

files generated by reducers.

https://hbase.apache.org/book.html#mapreduce.example

Please refer to https://hbase.apache.org/book.html#mapreduce.example for

the examples of HBase MapReduce Write.

Hints:

1. The mapper is just like a normal MapReduce mapper.

2. Your reducer should be extended from TableReducer<K, V>. The output

key is ignored, and the value data type is ImmutableBytesWritable. The

reduce() function will aggregate the number of comments for a user. You

need to create a Put object to store the information, and HBase will use

this object to insert the information into table ‘votestats’. Your reducer

code will be like:

public static class UserVotesReducer extends

TableReducer<Text, IntWritable, ImmutableBytesWritable> {

public void reduce(Text key, Iterable<IntWritable>

values, Context context)throws IOException,

InterruptedException {

… //do your job

}

}

3. In the main function, you will need to use the function

TableMapReduceUtil.initTableReducerJob() to configure the reducer.

4. You can create the table in the main function, or using the HBase shell.

5. Because the data is written to HBase, you do not need to configure the

data output path. You only need to specify the input path in Eclipse.

The code “WriteHBaseExample.java” is available at the course webpage.

Try to write the reducer by yourself, and learn how to configure the HBase

write job from that file.

Problem 2:

Read input data from “Comments”, and calculate the average score of

comments for each question. The result will be written to an HBase table

“post_comment_score”, with only one column family “stats”.

Refer to the code “WriteHBaseExample.java” and write your code in

“WriteHBaseComment.java” in package “comp9313.lab6”.

Manage Data Using Hive

Hive Installation and Configuration

1. Download Hive 2.1.0

https://hbase.apache.org/book.html#mapreduce.example

$ wget http://apache.uberglobalmirror.com/hive/stable-2/apache-hive-

2.1.0-bin.tar.gz

Then unpack the package:

$ tar xvf apache-hive-2.1.0-bin.tar.gz

2. Define environment variables for Hive

We need to configure the working directory of Hive, i.e., HIVE_HOME.

Open the file ~/.bashrc and add the following lines at the end of this file:

export HIVE_HOME = ~/apache-hive-2.1.0-bin

export PATH = $HIVE_HOME/bin:$PATH

Save the file, and then run the following command to take these

configurations into effect:

$ source ~/.bashrc

3. Create /tmp and /user/hive/warehouse and set them chmod g+w for more

than one user usage

$ hdfs dfs -mkdir /tmp

$ hdfs dfs -mkdir –p /user/hive/warehouse

$ hdfs dfs -chmod g+w /tmp

$ hdfs dfs -chmod g+w /user/hive/warehouse

4. Run the schematool command to initialize Hive

$ schematool -dbType derby -initSchema

Now you have already done the basic configuration of Hive, and it is ready

to use. Start Hive shell by the following command (start HDFS and YARN

first!):

$ hive

Practice Hive

1. Download the test file “employees.txt” from the course webpage. The file

contains only 7 records. Put the file at the home folder.

2. Create a database

$ hive> create database employee_data;

$ hive> use employee_data;

3. All databases are created under /user/hive/warehouse directory.

$ hdfs dfs –ls /user/hive/warehouse

4. Create the employee table

$ hive> CREATE TABLE employees (

name STRING,

salary FLOAT,

subordinates ARRAY<STRING>,

deductions MAP<STRING, FLOAT>,

address STRUCT<street:STRING, city:STRING, state:STRING,

zip:INT>

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\001'

COLLECTION ITEMS TERMINATED BY '\002'

MAP KEYS TERMINATED BY '\003'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

Because '\001', '\002', '\003', and '\n' are by default, and thus you can ignore

“ROW FORMAT DELIMITED”. “STORED AS TEXTFILE” is also by

default, and can be ignored as well.

5. Show all tables in the current database

$ hive> show tables;

6. Load data from local file system into table

$ hive> LOAD DATA LOCAL INPATH '/home/comp9313/employees.txt'

OVERWRITE INTO TABLE employees;

After loading the data into the table, you can check in HDFS what happened:

$ hdfs dfs –ls /user/hive/warehouse/employee_data.db/employees

The file employees.txt is copied into this folder corresponding to the table.

7. Check the data in the table

$ select * from employees;

8. You can do various queries based on the employees table, just as in an

RDBMS. For example:

Question 1: show the number of employees and their average salary

Hint: use count() and avg()

Question 2: find the employee who has the highest salary

Hint: use max(), IN clause, and subquery in where clause

9. Usage of explode(). Find all employees who are the subordinate of

another person. explode() takes in an array (or a map) as an input and

outputs the elements of the array (map) as separate rows.

$ hive> SELECT explode(subordinates) FROM employees;

10. Hive partitions. When defining employees, it is not partitioned, and thus

you cannot add a partition to it. You can only add a new partition to a table

has already been partitioned!

Create a table employees2, and load the same file into it.

$ hive> CREATE TABLE employees2 (

name STRING,

salary FLOAT,

subordinates ARRAY<STRING>,

deductions MAP<STRING, FLOAT>,

address STRUCT<street:STRING, city:STRING, state:STRING,

zip:INT>

)PARTITIONED BY (join_year STRING);

$ hive> LOAD DATA LOCAL INPATH '/home/comp9313/employees.txt'

OVERWRITE INTO TABLE employees2 PARTITION (join_year=”2015”);

Now check HDFS again to see what happened:

$ hdfs dfs –ls /user/hive/warehouse/employ_data.db/employees2

You will see a folder “join_year=2015” created in this folder, corresponding

to the partition join_year= “2015”.

Add a new partition join_year=“2016” to the table.

$ hive> ALTER TABLE employees2 ADD PARTITION (join_year=’2016’)

LOCATION

‘/user/hive/warehouse/employee_data.db/employees2/join_year=2016’;

Check in HDFS, and you will see a new folder created for this partition.

11. Insert a record to partition join_year=“2016”.

Because Hive does not support literals for complex types (array, map, struct,

union), so it is not possible to use them in INSERT INTO...VALUES

clauses. You need to create a file to store the new record, and then load it

into the partition.

$ cp employees.txt employees2016.txt

Then use vim or gedit to edit employees2016.txt to add some records, and

then load the file into the partition.

12. Query on a partition. Question: find all employees joined in the year

2016 whose salary is more than 60000.

13. (optional) Do word count in Hive, using the file employees.txt.

Manage Data Using Pig

Pig Installation and Configuration

1. Download Pig 0.16.0

$ wget http://mirror.ventraip.net.au/apache/pig/pig-0.16.0/pig-

0.16.0.tar.gz

Then unpack the package:

$ tar xvf pig-0.16.0.tar.gz

2. Define environment variables for Pig

We need to configure the working directory of Hive, i.e., PIG_HOME.

Open the file ~/.bashrc and add the following lines at the end of this file:

export PIG_HOME = ~/pig-0.16.0

export PATH = $PIG_HOME/bin:$PATH

Save the file, and then run the following command to take these

configurations into effect:

$ source ~/.bashrc

3. Now you have already done the basic configuration of Pig, and it is ready

to use. Start Pig Grunt shell by the following command (start HDFS and

YARN first!):

$ pig

Practice Pig

1. Download the test file “NYSE_dividends.txt” from the course webpage.

The file contains 670 records. Put the file to HDFS.

$ hdfs dfs –put NYSE_dividends.txt

Start the Hadoop job history server.

$ mr-jobhistory-daemon.sh start historyserver

2. Load Data using load command into Schema exchange, symbol, date,

dividend.

$ grunt> dividends = load 'NYSE_dividends.txt' as (exchange:chararray,

symbol:chararray, date:chararray, dividend:float);

$ grunt> dump dividends;

You should see results like:

3. Group rows by symbol.

$ grunt> grouped = group dividends by symbol;

4. Compute the average dividends for each symbol. Dividend value is

obtained using expression dividends.dividend (or dividends.$3). Store this

result in a variable avg.

$ grunt> avg = foreach grouped generate group, AVG(dividends.$3);

Use dump to check the contents of “avg”, you should see:

5. Store result avg into HDFS using store command

$ grunt> store avg into 'average_dividend';

6. Store result avg into HDFS using store command

$ grunt> fs -cat /user/comp9313/average_dividend/*

7. (optional) Do word count in Pig, using the file employees.txt.

More Practices

More practices of Hive and Pig are put into the second assignment.

Documents

Background - University of New South Wales · 2016-09-11 · Aims This exercise aims to get you to: Import data into HBase using bulk load Read MapReduce input from HBase and write