30
Introduction On Hive Ajit Koti

Apache Hive

Embed Size (px)

DESCRIPTION

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Citation preview

Page 1: Apache Hive

Introduction On Hive

Ajit Koti

Page 2: Apache Hive

Analysis of Data made by both engineering and non-engineering people.The data are growing fast. In 2007, the volume was 15TB and it grew up to 200TB in 2010.Current RDBMS can NOT handle it.Current solution are not available, not scalable, Expensive and Proprietary.

2

Motivation

Page 3: Apache Hive

MapReduce is a programing model and an associated implementation introduced by Goolge in 2004.Apache Hadoop is a software framework inspired by Google's MapReduce.

3

Map/Reduce -Apache Hadoop

Page 4: Apache Hive

Hadoop supports data-intensive distributed applications.

However...– Map-reduce hard to program (users know

sql/bash/python).– No schema.

4

Motivation (cont.)

Page 5: Apache Hive

A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

–ETL.–Structure.–Access to different storage.–Query execution via MapReduce.

Key Building Principles:–SQL is a familiar language–Extensibility – Types, Functions, Formats, Scripts

–Performance 5

What is HIVE?

Page 6: Apache Hive

Log processing Text mining Document indexing Customer-facing business intelligence (e.g., Google Analytics)

Predictive modeling, hypothesis testing

6

Hive Applications

Page 7: Apache Hive

7

Hive Architecture

Page 8: Apache Hive

Databases.Tables.Partitions.Buckets (or Clusters).

8

Data Units

Page 9: Apache Hive

Primitive types– Integers:TINYINT, SMALLINT, INT, BIGINT.– Boolean: BOOLEAN.– Floating point numbers: FLOAT, DOUBLE .– String: STRING.

Complex types– Structs: {a INT; b INT}.–Maps: M['group'].– Arrays: ['a', 'b', 'c'], A[1] returns 'b'.

9

Type System

Page 10: Apache Hive

Warehouse directory in HDFS – e.g., /user/hive/warehouse Table row data stored in subdirectories of warehouse

Partitions form subdirectories of table directories

Actual data stored in flat files – Control char-delimited text, or SequenceFiles – With custom SerDe, can use arbitrary format

10

Physical Layout

Page 11: Apache Hive

11

HiveQL

Page 12: Apache Hive

CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING);

SHOW TABLES '.*s';

DESCRIBE sample;

ALTER TABLE sample ADD COLUMNS (new_col INT);

DROP TABLE sample;

12

Examples – DDL Operations

Page 13: Apache Hive

LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');

LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');

13

Examples – DML Operations

Page 14: Apache Hive

LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');

LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');

14

Examples – DML Operations

Page 15: Apache Hive

SELECT foo FROM sample WHERE ds='2012-02-24';

INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02-24';

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT * FROM sample;

15

SELECTS and FILTERS

Page 16: Apache Hive

SELECT MAX(foo) FROM sample;

SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds;

FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar;

16

Aggregations and Groups

Page 17: Apache Hive

SELECT * FROM customer c JOIN order_cust o ON c.id=o.cus_id);SELECT c.id,c.name,c.address,ce.exp FROM customer c JOIN (SELECT cus_id,sum(price) AS exp FROM order_cust GROUP BY cus_id) ce ON (c.id=ce.cus_id);

17

Join

CREATE TABLE customer (id INT,name STRING,address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#';

CREATE TABLE order_cust (id INT,cus_id INT,prod_id INT,price INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

Page 18: Apache Hive

18

Extension mechanisms

Page 19: Apache Hive

package com.example.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;

public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); }}

19

User-defined function

Registering the class

CREATE FUNCTION my_lower AS 'com.example.hive.udf.Lower';

Using the function

SELECT my_lower(title), sum(freq) FROM titles GROUP BY my_lower(title);

Java code

Page 20: Apache Hive

Mathematical: round, floor, ceil, rand, exp...

Collection: size, map_keys, map_values, array_contains.

Type Conversion: cast.

Date: from_unixtime, to_date, year, datediff...

Conditional: if, case, coalesce.

String: length, reverse, upper, trim...

20

Built-in Functions

Page 21: Apache Hive

my_append.py

Using the function

21

Map/Reduce Scripts

for line in sys.stdin: line = line.strip() key = line.split('\t')[0] value = line.split('\t')[1] print key+str(i)+'\t'+value+str(i) i=i+1

SELECT TRANSFORM (foo, bar) USING 'python ./my_append.py' FROM sample;

Page 22: Apache Hive

22

Install and Config Hive

Page 23: Apache Hive

From a Release Tarball:$ wget http://archive.apache.org/dist/hadoop/hive/hive-0.5.0/hive-0.5.0-bin.tar.gz$ tar xvzf hive-0.5.0-bin.tar.gz$ cd hive-0.5.0-bin$ export HIVE_HOME=$PWD$ export PATH=$HIVE_HOME/bin:$PATH

23

Installing Hive

Page 24: Apache Hive

Java 1.6 Hadoop >= 0.17-0.20 Hive *MUST* be able to find Hadoop: – $HADOOP_HOME=<hadoop-install-dir>– Add $HADOOP_HOME/bin to $PATH Hive needs r/w access to /tmp and /user/hive/warehouse on HDFS:

$ hadoop fs –mkdir /tmp$ hadoop fs –mkdir /user/hive/warehouse$ hadoop fs –chmod g+w /tmp$ hadoop fs –chmod g+w /user/hive/warehouse

24

Hive Dependencies

Page 25: Apache Hive

Default configuration in $HIVE_HOME/conf/hive-default.xml – DO NOT TOUCH THIS FILE! Re(Define) properties in $HIVE_HOME/conf/hive-site.xml Use $HIVE_CONF_DIR to specify alternate conf dir

location You can override Hadoop configuration properties

in Hive’s configuration, e.g:– mapred.reduce.tasks=1

25

Hive Configuration

Page 26: Apache Hive

26

Hive CLI

Page 27: Apache Hive

Start a terminal and run $ hive Should see a prompt like: hive> Set a Hive or Hadoop conf prop: – hive> set propkey=value; List all properties and values: – hive> set –v; Add a resource to the DCache: – hive> add [ARCHIVE|FILE|JAR] filename;

27

Hive CLI Commands

Page 28: Apache Hive

List tables: – hive> show tables; Describe a table: – hive> describe <tablename>; More information: – hive> describe extended <tablename>; List Functions: – hive> show functions; More information: – hive> describe function<functionname>;

28

Hive CLI Commands

Page 29: Apache Hive

A easy way to process large scale data.Support SQL-based queries.Provide more user defined interfaces to extend

Programmability.Files in HDFS are immutable. Tipically:– Log processing: Daily Report, User Activity

Measurement–Data/Text mining: Machine learning (Training Data)– Business intelligence: Advertising Delivery,Spam

Detection

29

Conclusion

Page 30: Apache Hive

30

Thank You

Follow @ajitkoti