Hive + HCatalog

Preview:

DESCRIPTION

Big Data ETL with Hive and HCatalog, using the public StackOverflow dataset. Includes installation instructions.

Citation preview

Hive + HCatalogBy Amru Eliwat

CS 157B at San Jose State Universitywww.linkedin.com/in/amrue/

Agenda• What is Hive?

• What is HCatalog?- Using it with Hive

• Setting up Hive + HCat locally

• Setting up Hive + HCat in a Virtual Machine

• Demo- Loading data into HCat manually

- Loading data into HCat using Hive

- Basic Hive queries

Total time: Approximately 30 minutes

Hive

• “Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.”

Hive

• Runs SQL-like queries using HiveQL, which are implicitly converted into map-reduce jobs.

• Because of HiveQL’s declarative nature, Hive excels at Ad-Hoc analysis.

Hive• Hive acts on metadata in the Hive metastore.

• Metadata is stored in an Apache Derby database by default, but a MySQL database can be used instead.

- When using the default Derby database, only one process can connect to the metastore at a time, so this is only ideal for testing purposes.

• Although queries are run on data in the metastore, we do not get the efficiency and optimization of an RDBMS since the queries are converted into map-reduce jobs.

Hive• Peter Jamack at the IBM Developer Works blog asks,

Hive for ETL or ELT?

• You can extract, transform, then load your data with Hive, but Jamack suggests it is better to extract, load, then transform with Hive.

• Hive works better for some types of data then others.

“Obviously, choosing between adopting an ELT or ETL philosophy requires thought. This decision can account for more than 70 percent of the planning time required for many data warehouse, master data management, and other database projects.”

HCatalog

• “Apache HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid.”

HCatalog• HCatalog presents a relational view of data. Data

is stored in tables and these tables can be placed into databases.

• Hive can read data in HCatalog directly, because HCatalog is based on Hive’s metastore. Other tools require interfaces, such as ‘HCatLoader’ and ‘HCatStorer’ for Pig.

• In other words, HCatalog can be seen as a project enabling non-Hive scripts to access Hive metastore tables.

HCatalog

• As mentioned earlier, we do not get the efficiency of RDBMS despite the data being presented in a relation view.

SetupInstalling Hive requires some work, however it comes with HCatalog out-of-the-box to use as the metastore.

1. Download and unpack the tarball (.tar.gz).

2. Set the environment variable HIVE_HOME to point to the installation directory:

export HIVE_HOME=“/usr/local/hive-0.12.0”

3. Add $HIVE_HOME/bin to your PATH:

export PATH=$HIVE_HOME/bin:$PATH

4. You will need Hadoop installed to continue. Create the following folders for Hive’s metastore then set them to chmod g+w in HDFS like so:$HADOOP_HOME/bin/hadoop fs -mkdir /tmp$HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

Setup5. Copy the file hive-default.xml.template in /hive-0.12.0/conf and rename it hive-site.xml.

6. Finally, run $HIVE_HOME/hcatalog/sbin/hcat_server.sh start in the terminal.

7. Create a table in Hive using CREATE TABLE.

8. Load data:

LOAD DATA LOCAL INPATH './files/stackoverflow.txt' OVERWRITE INTO TABLE posts;

Hive + HCatalog• If you’ve made it this far, you can use your knowledge of

SQL to run HiveQL queries.

SELECT column_name FROM table_name;

SELECT a.* FROM a JOIN b ON (a.id = b.id);

“Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job.”

http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/

Setup #2

• Download the correct version of HortonWorks Sandbox for your virtual machine setup from the HortonWorks webpage.

• Double click the HortonWorks Sandbox virtual machine launch file (its the only file in the folder you just downloaded).

• Point your browser to 127.0.0.1:8888

DemoLoading data into HCat and analyzing it with Hive

Demo

• Fire up HortonWorks in VirtualBox

• Click on “HCatalog” on the upper toolbar

• On the left hand side, choose “Create New Table Manually” and give it a name. Then:

• Finally, select a file to upload.

• Alternatively, use Hive for the whole process:

• Once the data is loaded into HCat, we can take a closer look at the data.

SELECT * FROM post_etl WHERE posttypeid = 1

Q&A

Recommended