Hive + HCatalog

Hive + HCatalogBy Amru Eliwat

CS 157B at San Jose State Universitywww.linkedin.com/in/amrue/

Agenda• What is Hive?

• What is HCatalog?- Using it with Hive

• Setting up Hive + HCat locally

• Setting up Hive + HCat in a Virtual Machine

• Demo- Loading data into HCat manually

- Loading data into HCat using Hive

- Basic Hive queries

Total time: Approximately 30 minutes

• “Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.”

• Runs SQL-like queries using HiveQL, which are implicitly converted into map-reduce jobs.

• Because of HiveQL’s declarative nature, Hive excels at Ad-Hoc analysis.

Hive• Hive acts on metadata in the Hive metastore.

• Metadata is stored in an Apache Derby database by default, but a MySQL database can be used instead.

- When using the default Derby database, only one process can connect to the metastore at a time, so this is only ideal for testing purposes.

• Although queries are run on data in the metastore, we do not get the efficiency and optimization of an RDBMS since the queries are converted into map-reduce jobs.

Hive• Peter Jamack at the IBM Developer Works blog asks,

Hive for ETL or ELT?

• You can extract, transform, then load your data with Hive, but Jamack suggests it is better to extract, load, then transform with Hive.

• Hive works better for some types of data then others.

“Obviously, choosing between adopting an ELT or ETL philosophy requires thought. This decision can account for more than 70 percent of the planning time required for many data warehouse, master data management, and other database projects.”

HCatalog

• “Apache HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid.”

HCatalog• HCatalog presents a relational view of data. Data

is stored in tables and these tables can be placed into databases.

• Hive can read data in HCatalog directly, because HCatalog is based on Hive’s metastore. Other tools require interfaces, such as ‘HCatLoader’ and ‘HCatStorer’ for Pig.

• In other words, HCatalog can be seen as a project enabling non-Hive scripts to access Hive metastore tables.

HCatalog

• As mentioned earlier, we do not get the efficiency of RDBMS despite the data being presented in a relation view.

SetupInstalling Hive requires some work, however it comes with HCatalog out-of-the-box to use as the metastore.

1. Download and unpack the tarball (.tar.gz).

2. Set the environment variable HIVE_HOME to point to the installation directory:

export HIVE_HOME=“/usr/local/hive-0.12.0”

3. Add $HIVE_HOME/bin to your PATH:

export PATH=$HIVE_HOME/bin:$PATH

4. You will need Hadoop installed to continue. Create the following folders for Hive’s metastore then set them to chmod g+w in HDFS like so:$HADOOP_HOME/bin/hadoop fs -mkdir /tmp$HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

Setup5. Copy the file hive-default.xml.template in /hive-0.12.0/conf and rename it hive-site.xml.

6. Finally, run $HIVE_HOME/hcatalog/sbin/hcat_server.sh start in the terminal.

7. Create a table in Hive using CREATE TABLE.

8. Load data:

LOAD DATA LOCAL INPATH './files/stackoverflow.txt' OVERWRITE INTO TABLE posts;

Hive + HCatalog• If you’ve made it this far, you can use your knowledge of

SQL to run HiveQL queries.

SELECT column_name FROM table_name;

SELECT a.* FROM a JOIN b ON (a.id = b.id);

“Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job.”

http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/

Setup #2

• Download the correct version of HortonWorks Sandbox for your virtual machine setup from the HortonWorks webpage.

• Double click the HortonWorks Sandbox virtual machine launch file (its the only file in the folder you just downloaded).

• Point your browser to 127.0.0.1:8888

DemoLoading data into HCat and analyzing it with Hive

• Fire up HortonWorks in VirtualBox

• Click on “HCatalog” on the upper toolbar

• On the left hand side, choose “Create New Table Manually” and give it a name. Then:

• Finally, select a file to upload.

• Alternatively, use Hive for the whole process:

• Once the data is loaded into HCat, we can take a closer look at the data.

SELECT * FROM post_etl WHERE posttypeid = 1

Hive + HCatalog

Documents

data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:

smart.science.go.kr · 2016-01-18 · Hcatalog Map Reduce HCatalog Hive Sequence File Streaming Custom Format 01 EH < OOZIE, HCATALOG, ZOOKEEPER> Hcatalog E- Hcatalog9-1 ILIÄL-h

Hive Research Lab Interim Brief › 2014 › 04 › hive-researc… · Hive Research Lab Interim Brief Mapping Social Learning Ecologies of Hive Youth April 2014 ... • Anthony had

HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

SQOOP HCatalog Integration

Aloha Hive BUZZ · Hive, oh Hive Never so alive Oh how I love Hive Oh Hive, Oh Hive Menehune and goats Counselors and boats Oh how I love you Oh camp, oh camp High five for Hive Oh

Hive and Pig - VGCWikijuliana/courses/BigData2014/Lectures/hive... · Hive and Pig! • Hive: data warehousing application in Hadoop • Query language is HQL, variant of SQL •

TRAINING & CERTIFICATION - ONLINE SELF LEARNING€¦ · module 2: hadoop architecture and hdfs • hive • pig • mahout • hbase • hcatalog/hive • hbase administration module

THE HIVE@MANSFIELD - Stopford Associates · THE HIVE@MANSFIELD About us The Hive@Mansfield is part of the successful Hive at Nottingham Trent University. The Hive helps and supports

Hortonworks: We Do Hadoop.hortonworks.com/wp-content/uploads/2013/11/Webinar.HDP2_.201311… · Hortonworks: We Do Hadoop. ... Hadoop Pig HCatalog Hive HBase Sqoop Flume Oozie Zookeeper

Hcatalog HUG Draft5

Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture

HCatalog - tutorialspoint.comHCatalog 7 All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install a Linux flavor on your system

Introduction to Hive and HCatalog

Parasitic mite, Varroa species (Parasitiformes: Varroidae ...Langstroth Hive, Tanzania Top Bar Hive, Tanzania Commercial Hive, Log Hive and Bark Hive (Figure 4, Plates a, b, c). Table

TriHUG November HCatalog Talk by Alan Gates

Hive hcatalog

Hortonworks Data Platform - Apache Ambari Minor Upgrade ... · Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor

DESARROLLO INDUSTRIAL Querétaro, Querétaro...HIVE Buenavista Digital- Brochure 2020_Low Author Joaquín Calderón Subject HIVE Buenavista Keywords HIVE, Hive, Buenavista, condominio,

Introduction To Hive - Stanford Universitysnap.stanford.edu/class/cs341-2011/handout/hive/cs341-hive.pdf · Introduction To Hive How to use Hive in Amazon EC2 ... ... •Kafka helps