54
Clogeny’s Hadoop Developer Training Series An Introduction to Hive Madhur Nawandar [email protected] Cloud Computing Private & Public Clouds Big Data Storage DevOps

Clogeny's Hadoop Training Series - Apache Hive

Embed Size (px)

DESCRIPTION

This Hive hands-on training is part of Clogeny's Hadoop Training Series. This will give you a complete overview of Apache Hive including architecture, data models, installation, configuration and important Hive commands/scripts.

Citation preview

Page 1: Clogeny's Hadoop Training Series - Apache Hive

Clogeny’s Hadoop Developer Training Series

An Introduction to Hive

Madhur [email protected]

Cloud Computing

Private & Public Clouds Big Data

Storage

DevOps

Page 2: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

What is Hive?A data warehousing infrastructure based on HadoopProvides easy data summarizationProvides ad-hoc querying and analysis of large volumes of dataComes with Hive QL, based on SQLAllows to plug in custom mappers and reducers

Page 3: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

What Hive is NOTNot suitable for small datasets due to high latencyCannot be compared to systems like OracleDoes not offer real-time queries and row level updates

Page 4: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive Architecture

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Page 5: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Data Models Types - TablesTables• Made up of actual data and the associated metadata• Actual data is stored in a Hadoop Filesystem• Metadata is always stored in a relational database like MySQL• Managed Tables

Hive physically moves data into its warehouse $ CREATE TABLE managed_table (dummy STRING);

$ LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;

• External Tables Hive refers data from existing location in HDFS $ CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/tom/external_table'; $ LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Page 6: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Data Models Types - PartitionsPartitions• A way to divide tables into coarse-grained parts• Data is partitioned based on the value of partition

column• Supports multiple dimensions• Defined at table creation time using PARTITION BY

clause• At the filesystem level, partitions are simply nested

subdirectories of the table directory.

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Page 7: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Data Models Types - PartitionsCREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);

LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Page 8: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Data Model Types - BucketsBuckets• Partitions table within range• Enables more efficient queries by creating smaller

buckets of data rather than working with an entire partition.

• Make sampling more efficient$ CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Page 9: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Column Data Types

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

PrimitivesTYPE DESCRIPTION EXAMPLE

TINYINT 8-bit signed integer 1

SMALLINT 16-bit signed integer 1

INT 32-bit signed integer 1

BIGINT 64-bit signed integer 1

FLOAT 32-bit single precision floating point number

1.0

DOUBLE 64-bit double precision floating point number

1.0

BOOLEAN true/false value TRUE

STRING Character string ‘a’,”a”

TIMESTRAMP Timestamp with nanosecond precision

‘2012-01-02 03:04:05.123456789’

Page 10: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Column Data Types

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Complex Data TypesTYPE DESCRIPTION EXAMPLE

ARRAY An ordered collection of fields. The fields must all be of same type

array(1, 2)

MAP An unordered collection of key-value pairs. Keys must be primitives, values

may be any type. For a particular map, the keys must be the same type, and the values must be the

same type

map(‘a’, 1,’ b’, 2)

STRUCT A collection of named fields. The fields may be of different types

struct(‘a’, 1, 1.0)

Page 11: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Metastore

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

A central repository of Hive metadataComprises of 2 parts:• Metastore service• Backing store for the data

Page 12: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Metastore deployment modes1: Embedded Mode

This is the default metastore deployment mode for CDH. In this mode the metastore uses a Derby database.

Both the database and the metastore service run embedded in the main HiveServer process. Both are started for you when you start the HiveServer process.

This mode requires the least amount of effort to configure.

But it can support only one active user at a time and is not certified for production use.

Page 13: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Metastore deployment modes2: Local Mode

In this mode the Hive metastore service runs in the same process as the main HiveServer process, but the metastore database runs in a separate process, and can be on a separate host.

The embedded metastore service communicates with the metastore database over JDBC.

Page 14: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Metastore deployment modes3: Remote Mode

In this mode the Hive metastore service runs in its own JVM process; other processes communicate with it via the Thrift network API (configured via the hive.metastore.uris property). The metastore service communicates with the metastore database over JDBC (configured via the javax.jdo.option.ConnectionURL property).

Page 15: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Metastore Properties

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Property Name Type Description

hive.metastore.warehouse.dir URI The directory in HDFS where managed tables are stored

hive.metastore.local Boolean Flag for embedded metastore or local metastore

hive.metastore.uris Comma separated URIs

List of remote metastore URI’s

javax.jdo.option.ConnectionURL URI The JDBC URL of the metastore database

javax.jdo.option.ConnectionDriverName String The JDBC driver classname

javax.jdo.option.ConnectionUserName String The JDBC username

javax.jdo.option.ConnectionPassword String The JDBC password

Page 16: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive PackagesThe following packages are needed by Hive:• hive – base package that provides the complete

language and runtime (required)• hive-metastore – provides scripts for running the

metastore as a standalone service (optional)• hive-server – provides scripts for running the original

HiveServer as a standalone service (optional)• hive-server2 – provides scripts for running the new

HiveServer2 as a standalone service (optional)

Page 17: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Comparison with Traditional Databases

Schema on Read Verses Schema on Write• In a traditional database, a table’s schema is enforced at data

load time• If the data being loaded doesn’t conform to the schema, then

it is rejected• Hive, on the other hand, doesn’t verify the data when it is

loaded, but rather when a query is issued

Updates, Transactions, and Indexes• Updates, transactions, and indexes are mainstays of traditional

databases.• Until recently, these features have not been considered a part

of Hive’s feature set

Page 18: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Installing HiveWe will install hive with Metastore as a standalone serviceFor this, install the hive and Metastore packages as:

$ yum –y install hive hive-metastore

Page 19: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive ConfigurationDefault configuration in• /etc/hive/conf/hive-default.xml

Re(Define) properties in• /etc/hive/conf/hive-site.xml

Use $HIVE_CONF_DIR to specify alternate conf dir locationYou can override Hadoop configuration properties in Hive’s configuration• e.g: mapred.reduce.tasks=1

Page 20: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Configure Metastore databaseStep 1: Install and start MySQL if you have not already done so$ yum install mysql-server

Step 2: Configure the MySQL Service and Connector$ yum install mysql-connector-java$ ln -s /usr/share/java/mysql-connector-java-5.1.17.jar /usr/lib/hive/lib/mysql-connector-java-5.1.17.jar

Step 3: To set the MySQL root password:

Page 21: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Configure Metastore database

Page 22: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Configure Metastore database cont…Step 4: To make sure the MySQL server starts at boot• $ /sbin/chkconfig mysqld on

Step 5. Create the Database and User• Create the initial database schema using the hive-schema-

0.10.0.mysql.sql file located in the /usr/lib/hive/scripts/metastore/upgrade/mysql directory.

• Create a user for hive with the hostname of the metastore.• Grant proper privileges to the user.

Page 23: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Configure Metastore database cont…

Page 24: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Configure Metastore database cont…

Step 6: Configure the Metastore Service to Communicate with the MySQL Database• This step shows the configuration properties you need

to set in hive-site.xml to configure the metastore service to communicate with the MySQL database, and provides sample settings. Though you can use the same

• hive-site.xml on all hosts (client, metastore, HiveServer)• hive.metastore.uris is the only property that must be

configured on all of them; the others are used only on the metastore host.

Page 25: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Configure Metastore database cont…

Page 26: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Configure Metastore database cont…

Page 27: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Configure Metastore database cont…Step 7: Create hive user directory in hdfs$ sudo –u hdfs hadoop fs –mkdir /user/hive/warehouse$ sudo –u hdfs hadoop fs –chmod og+rw /user/hive/warehouse$ sudo –u hdfs hadoop fs –chown –R hive /user/hive

Step 8: Set Environment Variables:• Add the following to .bashrc file $ vim ~/.bashrc export HADOOP_HOME="/usr/lib/hadoop" PATH=$PATH:"/usr/lib/hadoop/bin“• Run command “bash” on command prompt $ bash

Page 28: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Starting the MetastoreYou can run the metastore from the command line:$ hive --service metastore

Ensure that the above does not give any errorUse Ctrl-c to stop the metastore process running from the command line.To run the metastore as a daemon, the command is:$ service hive-metastore start

Page 29: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Starting the Hive ConsoleTo start the Hive console:$ hive

To confirm that Hive is working, issue the show tables; command to list the Hive tables; be sure to use a semi-colon after the command:hive> SHOW tables;

Page 30: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI CommandsSet a Hive or Hadoop conf property:hive> set propkey=value;

List all properties and values:hive> set –v;

Page 31: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI CommandsCreating managed table$ cat input/hive/tables/data.txt$ hive hive> CREATE TABLE managed_table (dummy STRING); hive> LOAD DATA LOCAL INPATH ‘input/hive/tables/data.txt' INTO table managed_table; hive> select * from managed_table; $ hadoop fs -cat /user/hive/warehouse/managed_table/data.txt

Page 32: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI Commands

Page 33: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI Commands

Page 34: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI CommandsCreating external table• Select a location in hdfs to create table• Ensure it has write access to other users

$ sudo -u hdfs hadoop fs -mkdir /user/joe/table$ sudo -u hdfs hadoop fs -chmod a+w /user/joe/table

• Create external table and load data into it:hive> CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/joe/table';hive> LOAD DATA LOCAL INPATH 'input/hive/tables/data.txt' INTO TABLE external_table;hive> select * from external_table;

• Check if the table was created in the external directory$ sudo -u hdfs hadoop fs -cat /user/joe/table/data.txt

Page 35: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI Commands

Page 36: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI Commands

Page 37: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI CommandsCreate Partitioned table

hive> CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);

Load data in table specifying the partitionshive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');

hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file2' INTO TABLE logs PARTITION (dt='2001-01-01', country='US');

hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file3' INTO TABLE logs PARTITION (dt='2001-01-02', country='US');

See the table contentshive> select * from logs;

List all the partitionshive> SHOW PARTITIONS logs;

Page 38: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI Commands

Page 39: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI Commands

Page 40: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI Commands

Page 41: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI CommandsCreate Bucket:• Create a normal table users and create a bucket named

bucketed_users from ithive> set hive.enforce.bucketing=true;

hive> CREATE TABLE users (id INT, name STRING);

hive> LOAD DATA LOCAL INPATH 'input/hive/tables/users.txt' INTO table users;

hive> CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

hive> INSERT OVERWRITE TABLE bucketed_users SELECT * FROM users;

• Check the contents of table per buckethive> select * from bucketed_users;

hive> select * from bucketed_users TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);

Page 42: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI Commands

Page 43: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Hive CLI Commands

Page 44: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

JoinsPrerequisites• Create 2 tables sales and things and load data from files

hive> CREATE TABLE sales (user STRING, id INT)row format delimited fields terminated by '\t' stored as textfile;

hive> LOAD DATA LOCAL INPATH 'input/hive/joins/sales.txt' INTO table sales;

hive> select * from sales;

hive> CREATE TABLE things (id INT, name STRING)row format delimited fields terminated by '\t' stored as textfile;

hive> LOAD DATA LOCAL INPATH 'input/hive/joins/things.txt' INTO table things;

hive> select * from things;

Page 45: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Joins

Page 46: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

JoinsInner Joinhive> SELECT sales.*, things.* FROM sales JOIN things ON (sales.id = things.id);

Page 47: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

JoinsLeft Outer Joinhive> SELECT sales.*, things.* FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);

Page 48: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

JoinsRight Outer Joinhive> SELECT sales.*, things.* FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);

Page 49: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

JoinsFull Outer Joinhive> SELECT sales.*, things.* FROM sales FULL OUTER JOIN things ON (sales.id = things.id);

Page 50: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

JoinsSemi Joins• Hive does not support IN sub queries

hive> SELECT * from things WHERE things.id IN (SELECT id from sales);

• So solution is semi joinshive> SELECT * from things LEFT SEMI JOIN ON (sales.id = things.id);

Page 51: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

JoinsMap Joins• Used in case when 1 table is very small enough to fit in

memory. No reducers usedhive> SELECT /*+ MAPJOIN(things) */ sales.*, things.* FROM sales JOIN things ON (sales.id = things.id);

Page 52: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Other CommandsCREATE TABLE…AS SELECThive> CREATE TABLE target AS SELECT id from things;

Altering Tableshive> ALTER TABLE target RENAME TO source;hive> ALTER TABLE source ADD COLUMNS (col2 STRING);

Page 53: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

Other CommandsDropping Tables• For managed tables both data and metadata is deleted• For external tables only metadata is deleted

hive> drop table <table_name>;

Page 54: Clogeny's Hadoop Training Series - Apache Hive

Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482

ReferencesHadoop: The Definitive Guide, 3rd EditionHive Community page• http://hive.apache.org/