Clogeny's Hadoop Training Series - Apache Hive

Clogeny’s Hadoop Developer Training Series

An Introduction to Hive

Madhur [email protected]

Cloud Computing

Private & Public Clouds Big Data

Storage

DevOps

http://clogeny.com/

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

What is Hive?A data warehousing infrastructure based on HadoopProvides easy data summarizationProvides ad-hoc querying and analysis of large volumes of dataComes with Hive QL, based on SQLAllows to plug in custom mappers and reducers

http://www.clogeny.com/

http://www.linkedin.com/company/clogeny

http://twitter.com/clogeny

http://blogs.clogeny.com/


What Hive is NOTNot suitable for small datasets due to high latencyCannot be compared to systems like OracleDoes not offer real-time queries and row level updates






Hive Architecture











Data Models Types - TablesTables• Made up of actual data and the associated metadata• Actual data is stored in a Hadoop Filesystem• Metadata is always stored in a relational database like MySQL• Managed Tables

Hive physically moves data into its warehouse $ CREATE TABLE managed_table (dummy STRING);

$ LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;

• External Tables Hive refers data from existing location in HDFS $ CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/tom/external_table'; $ LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;











Data Models Types - PartitionsPartitions• A way to divide tables into coarse-grained parts• Data is partitioned based on the value of partition

column• Supports multiple dimensions• Defined at table creation time using PARTITION BY

clause• At the filesystem level, partitions are simply nested

subdirectories of the table directory.











Data Models Types - PartitionsCREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);

LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');











Data Model Types - BucketsBuckets• Partitions table within range• Enables more efficient queries by creating smaller

buckets of data rather than working with an entire partition.

• Make sampling more efficient$ CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;











Column Data Types


PrimitivesTYPE DESCRIPTION EXAMPLE

TINYINT 8-bit signed integer 1

SMALLINT 16-bit signed integer 1

INT 32-bit signed integer 1

BIGINT 64-bit signed integer 1

FLOAT 32-bit single precision floating point number

1.0

DOUBLE 64-bit double precision floating point number

1.0

BOOLEAN true/false value TRUE

STRING Character string ‘a’,”a”

TIMESTRAMP Timestamp with nanosecond precision

‘2012-01-02 03:04:05.123456789’












Column Data Types


Complex Data TypesTYPE DESCRIPTION EXAMPLE

ARRAY An ordered collection of fields. The fields must all be of same type

array(1, 2)

MAP An unordered collection of key-value pairs. Keys must be primitives, values

may be any type. For a particular map, the keys must be the same type, and the values must be the

same type

map(‘a’, 1,’ b’, 2)

STRUCT A collection of named fields. The fields may be of different types

struct(‘a’, 1, 1.0)










Metastore


A central repository of Hive metadataComprises of 2 parts:• Metastore service• Backing store for the data










Metastore deployment modes1: Embedded Mode

This is the default metastore deployment mode for CDH. In this mode the metastore uses a Derby database.

Both the database and the metastore service run embedded in the main HiveServer process. Both are started for you when you start the HiveServer process.

This mode requires the least amount of effort to configure.

But it can support only one active user at a time and is not certified for production use.






Metastore deployment modes2: Local Mode

In this mode the Hive metastore service runs in the same process as the main HiveServer process, but the metastore database runs in a separate process, and can be on a separate host.

The embedded metastore service communicates with the metastore database over JDBC.






Metastore deployment modes3: Remote Mode

In this mode the Hive metastore service runs in its own JVM process; other processes communicate with it via the Thrift network API (configured via the hive.metastore.uris property). The metastore service communicates with the metastore database over JDBC (configured via the javax.jdo.option.ConnectionURL property).






Metastore Properties


Property Name Type Description

hive.metastore.warehouse.dir URI The directory in HDFS where managed tables are stored

hive.metastore.local Boolean Flag for embedded metastore or local metastore

hive.metastore.uris Comma separated URIs

List of remote metastore URI’s

javax.jdo.option.ConnectionURL URI The JDBC URL of the metastore database

javax.jdo.option.ConnectionDriverName String The JDBC driver classname

javax.jdo.option.ConnectionUserName String The JDBC username

javax.jdo.option.ConnectionPassword String The JDBC password










Hive PackagesThe following packages are needed by Hive:• hive – base package that provides the complete

language and runtime (required)• hive-metastore – provides scripts for running the

metastore as a standalone service (optional)• hive-server – provides scripts for running the original

HiveServer as a standalone service (optional)• hive-server2 – provides scripts for running the new

HiveServer2 as a standalone service (optional)






Comparison with Traditional Databases

Schema on Read Verses Schema on Write• In a traditional database, a table’s schema is enforced at data

load time• If the data being loaded doesn’t conform to the schema, then

it is rejected• Hive, on the other hand, doesn’t verify the data when it is

loaded, but rather when a query is issued

Updates, Transactions, and Indexes• Updates, transactions, and indexes are mainstays of traditional

databases.• Until recently, these features have not been considered a part

of Hive’s feature set






Installing HiveWe will install hive with Metastore as a standalone serviceFor this, install the hive and Metastore packages as:

$ yum –y install hive hive-metastore






Hive ConfigurationDefault configuration in• /etc/hive/conf/hive-default.xml

Re(Define) properties in• /etc/hive/conf/hive-site.xml

Use $HIVE_CONF_DIR to specify alternate conf dir locationYou can override Hadoop configuration properties in Hive’s configuration• e.g: mapred.reduce.tasks=1






Configure Metastore databaseStep 1: Install and start MySQL if you have not already done so$ yum install mysql-server

Step 2: Configure the MySQL Service and Connector$ yum install mysql-connector-java$ ln -s /usr/share/java/mysql-connector-java-5.1.17.jar /usr/lib/hive/lib/mysql-connector-java-5.1.17.jar

Step 3: To set the MySQL root password:






Configure Metastore database






Configure Metastore database cont…Step 4: To make sure the MySQL server starts at boot• $ /sbin/chkconfig mysqld on

Step 5. Create the Database and User• Create the initial database schema using the hive-schema-

0.10.0.mysql.sql file located in the /usr/lib/hive/scripts/metastore/upgrade/mysql directory.

• Create a user for hive with the hostname of the metastore.• Grant proper privileges to the user.






Configure Metastore database cont…







Step 6: Configure the Metastore Service to Communicate with the MySQL Database• This step shows the configuration properties you need

to set in hive-site.xml to configure the metastore service to communicate with the MySQL database, and provides sample settings. Though you can use the same

• hive-site.xml on all hosts (client, metastore, HiveServer)• hive.metastore.uris is the only property that must be

configured on all of them; the others are used only on the metastore host.


















Configure Metastore database cont…Step 7: Create hive user directory in hdfs$ sudo –u hdfs hadoop fs –mkdir /user/hive/warehouse$ sudo –u hdfs hadoop fs –chmod og+rw /user/hive/warehouse$ sudo –u hdfs hadoop fs –chown –R hive /user/hive

Step 8: Set Environment Variables:• Add the following to .bashrc file $ vim ~/.bashrc export HADOOP_HOME="/usr/lib/hadoop" PATH=$PATH:"/usr/lib/hadoop/bin“• Run command “bash” on command prompt $ bash






Starting the MetastoreYou can run the metastore from the command line:$ hive --service metastore

Ensure that the above does not give any errorUse Ctrl-c to stop the metastore process running from the command line.To run the metastore as a daemon, the command is:$ service hive-metastore start






Starting the Hive ConsoleTo start the Hive console:$ hive

To confirm that Hive is working, issue the show tables; command to list the Hive tables; be sure to use a semi-colon after the command:hive> SHOW tables;






Hive CLI CommandsSet a Hive or Hadoop conf property:hive> set propkey=value;

List all properties and values:hive> set –v;






Hive CLI CommandsCreating managed table$ cat input/hive/tables/data.txt$ hive hive> CREATE TABLE managed_table (dummy STRING); hive> LOAD DATA LOCAL INPATH ‘input/hive/tables/data.txt' INTO table managed_table; hive> select * from managed_table; $ hadoop fs -cat /user/hive/warehouse/managed_table/data.txt






Hive CLI Commands






Hive CLI Commands






Hive CLI CommandsCreating external table• Select a location in hdfs to create table• Ensure it has write access to other users

$ sudo -u hdfs hadoop fs -mkdir /user/joe/table$ sudo -u hdfs hadoop fs -chmod a+w /user/joe/table

• Create external table and load data into it:hive> CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/joe/table';hive> LOAD DATA LOCAL INPATH 'input/hive/tables/data.txt' INTO TABLE external_table;hive> select * from external_table;

• Check if the table was created in the external directory$ sudo -u hdfs hadoop fs -cat /user/joe/table/data.txt






Hive CLI Commands






Hive CLI Commands






Hive CLI CommandsCreate Partitioned table

hive> CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);

Load data in table specifying the partitionshive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');

hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file2' INTO TABLE logs PARTITION (dt='2001-01-01', country='US');

hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file3' INTO TABLE logs PARTITION (dt='2001-01-02', country='US');

See the table contentshive> select * from logs;

List all the partitionshive> SHOW PARTITIONS logs;






Hive CLI Commands






Hive CLI Commands






Hive CLI Commands






Hive CLI CommandsCreate Bucket:• Create a normal table users and create a bucket named

bucketed_users from ithive> set hive.enforce.bucketing=true;

hive> CREATE TABLE users (id INT, name STRING);

hive> LOAD DATA LOCAL INPATH 'input/hive/tables/users.txt' INTO table users;

hive> CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

hive> INSERT OVERWRITE TABLE bucketed_users SELECT * FROM users;

• Check the contents of table per buckethive> select * from bucketed_users;

hive> select * from bucketed_users TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);






Hive CLI Commands






Hive CLI Commands






JoinsPrerequisites• Create 2 tables sales and things and load data from files

hive> CREATE TABLE sales (user STRING, id INT)row format delimited fields terminated by '\t' stored as textfile;

hive> LOAD DATA LOCAL INPATH 'input/hive/joins/sales.txt' INTO table sales;

hive> select * from sales;

hive> CREATE TABLE things (id INT, name STRING)row format delimited fields terminated by '\t' stored as textfile;

hive> LOAD DATA LOCAL INPATH 'input/hive/joins/things.txt' INTO table things;

hive> select * from things;






Joins






JoinsInner Joinhive> SELECT sales.*, things.* FROM sales JOIN things ON (sales.id = things.id);






JoinsLeft Outer Joinhive> SELECT sales.*, things.* FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);






JoinsRight Outer Joinhive> SELECT sales.*, things.* FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);






JoinsFull Outer Joinhive> SELECT sales.*, things.* FROM sales FULL OUTER JOIN things ON (sales.id = things.id);






JoinsSemi Joins• Hive does not support IN sub queries

hive> SELECT * from things WHERE things.id IN (SELECT id from sales);

• So solution is semi joinshive> SELECT * from things LEFT SEMI JOIN ON (sales.id = things.id);






JoinsMap Joins• Used in case when 1 table is very small enough to fit in

memory. No reducers usedhive> SELECT /*+ MAPJOIN(things) */ sales.*, things.* FROM sales JOIN things ON (sales.id = things.id);






Other CommandsCREATE TABLE…AS SELECThive> CREATE TABLE target AS SELECT id from things;

Altering Tableshive> ALTER TABLE target RENAME TO source;hive> ALTER TABLE source ADD COLUMNS (col2 STRING);






Other CommandsDropping Tables• For managed tables both data and metadata is deleted• For external tables only metadata is deleted

hive> drop table <table_name>;






ReferencesHadoop: The Definitive Guide, 3rd EditionHive Community page• http://hive.apache.org/

http://hive.apache.org/







Technology

Clogeny's Hadoop Training Series - Apache Hive