59
Copyright © 20112012, Think Big Analy8cs, All Rights Reserved Hive: SQL for Hadoop An Essen8al Tool for Hadoopbased Data Warehouses Tuesday, January 10, 12 I’ll argue that Hive is indispensable to people crea8ng “data warehouses” with Hadoop, because it gives them a “similar” SQL interface to their data, making it easier to migrate skills and even apps from exis8ng rela8onal tools to Hadoop.

Hive sq lfor-hadoop

Embed Size (px)

DESCRIPTION

Hive hadoop learning hive for bigdata,Hive hadoop learning hive for bigdata

Citation preview

Page 1: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

Hive: SQL for Hadoop

An  Essen8al  Tool  for  Hadoop-­‐based  

Data  Warehouses

Tuesday, January 10, 12

I’ll  argue  that  Hive  is  indispensable  to  people  crea8ng  “data  warehouses”  with  Hadoop,  because  it  gives  them  a  “similar”  SQL  interface  to  their  data,  making  it  easier  to  migrate  skills  and  even  apps  from  exis8ng  rela8onal  tools  to  Hadoop.

Page 2: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

Adapted from ourHadoop 3-Day Developer Course

[email protected]

Tuesday, January 10, 12

These  notes  are  (heavily)  adapted  from  our  Hadoop  3-­‐day  developer  course.  Of  course,  we  won’t  do  exercises  in  this  talk  ;)  The  second  day  focuses  on  Hive.  We  also  offer  Hive  training  by  itself,  admin  courses,  etc.  

Page 3: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

Programming Hive Summer, 2012

O’Reilly

Tuesday, January 10, 12

I’m  collabora8ng  with  two  other  colleagues  to  write  “Programming  Hive”,  which  will  be  published  by  O’Reilly  this  summer.  Un8l  it’s  ready,  the  best  source  of  documenta8on  is  on  the  Hive  site,  hYp://hive.apache.org.

Page 4: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved4

programmingscala.compolyglotprogramming.com/fpjava

Dean Wampler

Functional Programming

for Java Developers

Dean Wampler

20+ years exp.

Think Big Academy

Tuesday, January 10, 12

Page 5: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved5

A True Story...

Tuesday, January 10, 12

Page 6: Hive sq lfor-hadoop

Once upon an Internet time, in the land of Enterprise, the King’s Warehouse of Data was bursting at the seams.

Tuesday, January 10, 12

Page 7: Hive sq lfor-hadoop

And lo, the scribes decided they could cut costs and store more data by moving to Hadoop.

Tuesday, January 10, 12

Page 8: Hive sq lfor-hadoop

But then the scribes complained amongst themselves, “The language of the land of Java hurts our tongues!

Tuesday, January 10, 12

Page 9: Hive sq lfor-hadoop

“It takes forever to say anything!”

Tuesday, January 10, 12

Page 10: Hive sq lfor-hadoop

There was a scribe who walked through the Kingdom-wide Labyrinth (kwl), where he found The Book of Faces.

Tuesday, January 10, 12

Page 11: Hive sq lfor-hadoop

In that book, he found the secret language the Bee Keepers use to query the god Hadoop!

Tuesday, January 10, 12

Page 12: Hive sq lfor-hadoop

And there was much rejoicing, for the scribes could work with their data again!

Tuesday, January 10, 12

Page 13: Hive sq lfor-hadoop

The End.

Tuesday, January 10, 12

Page 14: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved14

Enter Hive

Tuesday, January 10, 12

Page 15: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved15

Since your team knows SQL and all your Data Warehouse

apps are written in SQL, Hive minimizes the effort of

migrating to Hadoop.

Tuesday, January 10, 12

Page 16: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• Ideal for data warehousing.

• Ad-hoc queries of data.

• Familiar SQL dialect.

• Analysis of large data sets.

• Hadoop MapReduce jobs.16

Hive

Tuesday, January 10, 12

Hive is a killer app, in our opinion, for data warehouse teams migrating to Hadoop, because it gives them a familiar SQL language that hides the complexity of MR programming.

Page 17: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• Invented at Facebook.

• Open sourced to Apache in 2008.• http://hive.apache.org

17

Hive

Tuesday, January 10, 12

Page 18: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved18

A Scenario:Mining Daily Click Stream

Logs

Tuesday, January 10, 12

Page 19: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved19

Ingest & Transform:• From: file://server1/var/log/clicks.log

Jan 9 09:02:17 server1 movies[18]: 1234: search for “vampires in love”.…

Tuesday, January 10, 12

As we copy the daily click stream log over to a local staging location, we transform it into the Hive table format we want.

Page 20: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved20

Ingest & Transform:• From: file://server1/var/log/clicks.log

Jan 9 09:02:17 server1 movies[18]: 1234: search for “vampires in love”.…

Timestamp

Tuesday, January 10, 12

Page 21: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved21

Ingest & Transform:• From: file://server1/var/log/clicks.log

Jan 9 09:02:17 server1 movies[18]: 1234: search for “vampires in love”.…

The server

Tuesday, January 10, 12

Page 22: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved22

Ingest & Transform:• From: file://server1/var/log/clicks.log

Jan 9 09:02:17 server1 movies[18]: 1234: search for “vampires in love”.…

The process (“movies search”) and the process id.

Tuesday, January 10, 12

Page 23: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved23

Ingest & Transform:• From: file://server1/var/log/clicks.log

Jan 9 09:02:17 server1 movies[18]: 1234: search for “vampires in love”.…

Customer id

Tuesday, January 10, 12

Page 24: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved24

Ingest & Transform:• From: file://server1/var/log/clicks.log

Jan 9 09:02:17 server1 movies[18]: 1234: search for “vampires in love”.…

The log “message”

Tuesday, January 10, 12

Page 25: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved25

Ingest & Transform:• From: file://server1/var/log/clicks.log

09:02:17^Aserver1^Amovies^A18^A1234^Asearch for “vampires in love”.…

• To: /staging/2012-01-09.log

Jan 9 09:02:17 server1 movies[18]: 1234: search for “vampires in love”.…

Tuesday, January 10, 12

As we copy the daily click stream log over to a local staging location, we transform it into the Hive table format we want.

Page 26: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved26

Ingest & Transform:

09:02:17^Aserver1^Amovies^A18^A1234^Asearch for “vampires in love”.…

• To: /staging/2012-01-09.log

• Removed month (Jan) and day (09).

• Added ^A as field separators (Hive convention).

• Separated process id from process name.

Tuesday, January 10, 12

The transformations we made. (You could use many different Linux, scripting, code, or Hadoop-related ingestion tools to do this.

Page 27: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved27

Ingest & Transform:

hadoop fs -put /staging/2012-01-09.log \ /clicks/2012/01/09/log.txt

• Put in HDFS:

• (The final file name doesn’t matter…)

Tuesday, January 10, 12

Here we use the hadoop shell command to put the file where we want it in the file system. Note that the name of the target file doesn’t matter; we’ll just tell Hive to read all files in the directory, so there could be many files there!

Page 28: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved28

Back to Hive...

CREATE EXTERNAL TABLE clicks ( hms STRING, hostname STRING, process STRING, pid INT, uid INT, message STRING) PARTITIONED BY ( year INT, month INT, day INT);

• Create an external Hive table:

You don’t have to use EXTERNAL

and PARTITIONED together….

Tuesday, January 10, 12

Now let’s create an “external” table that will read those files as the “backing store”. Also, we make it partitioned to accelerate queries that limit by year, month or day. (You don’t have to use external and partitioned together…)

Page 29: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved29

Back to Hive...

ALTER TABLE clicks ADD IF NOT EXISTS PARTITION ( year=2012, month=01, day=09) LOCATION '/clicks/2012/01/09';

• Add a partition for 2012-01-09:

• A directory in HDFS.

Tuesday, January 10, 12

We add a partition for each day. Note the LOCATION path, which is a the directory where we wrote our file.

Page 30: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved30

Now, Analyze!!

SELECT hms, uid, message FROM clicks WHERE message LIKE '%vampire%' AND year = 2012 AND month = 01 AND day = 09;

• What’s with the kids and vampires??

• After some MapReduce crunching...… 09:02:29 1234 search for “twilight of the vampires”09:02:35 1234 add to cart “vampires want their genre back”…

Tuesday, January 10, 12

And we can run SQL queries!!

Page 31: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• SQL analysis with Hive.

• Other tools can use the data, too.

• Massive scalability with Hadoop.

31

Recap

Tuesday, January 10, 12

Page 32: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved32

Hadoop

Master✲ Job Tracker Name Node HDFS

Hive

Driver(compiles, optimizes, executes)

CLI HWI Thrift Server

Metastore

JDBC ODBC

Karmasphere Others...

Tuesday, January 10, 12

Hive queries generate MR jobs. (Some operations don’t invoke Hadoop processes, e.g., some very simple queries and commands that just write updates to the metastore.) CLI = Command Line Interface.HWI = Hive Web Interface.

Page 33: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• HDFS

• MapR

• S3

• HBase (new)

• Others...33

Tables

Hadoop

Master✲ Job Tracker Name Node HDFS

Hive

Driver(compiles, optimizes, executes)

CLI HWI Thrift Server

Metastore

JDBC ODBC

Karmasphere Others...

Tuesday, January 10, 12

There is “early” support for using Hive with HBase. Other databases and distributed file systems will no doubt follow.

Page 34: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• Table metadata stored in a relational DB.

34

Tables

Hadoop

Master✲ Job Tracker Name Node HDFS

Hive

Driver(compiles, optimizes, executes)

CLI HWI Thrift Server

Metastore

JDBC ODBC

Karmasphere Others...

Tuesday, January 10, 12

For production, you need to set up a MySQL or PostgreSQL database for Hive’s metadata. Out of the box, Hive uses a Derby DB, but it can only be used by a single user and a single process at a time, so it’s fine for personal development only.

Page 35: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• Most queries use MapReduce jobs.

35

Queries

Hadoop

Master✲ Job Tracker Name Node HDFS

Hive

Driver(compiles, optimizes, executes)

CLI HWI Thrift Server

Metastore

JDBC ODBC

Karmasphere Others...

Tuesday, January 10, 12

Hive generates MapReduce jobs to implement all the but the simplest queries.

Page 36: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• Benefits

• Horizontal scalability.

• Drawbacks

• Latency!

36

MapReduce Queries

Tuesday, January 10, 12

The high latency makes Hive unsuitable for “online” database use. (Hive also doesn’t support transactions and has other limitations that are relevant here…) So, these limitations make Hive best for offline (batch mode) use, such as data warehouse apps.

Page 37: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• Benefits

• Horizontal scalability.

• Data redundancy.

• Drawbacks

• No insert, update, and delete!

37

HDFS Storage

Tuesday, January 10, 12

You can generate new tables or write to local files. Forthcoming versions of HDFS will support appending data.

Page 38: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• Schema on Read

• Schema enforcement at query time, not write time.

38

HDFS Storage

Tuesday, January 10, 12

Especially for external tables, but even for internal ones since the files are HDFS files, Hive can’t enforce that records written to table files have the specified schema, so it does these checks at query time.

Page 39: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• No Transactions.

• Some SQL features not implemented (yet).

39

Other Limitations

Tuesday, January 10, 12

Page 40: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved40

More on Tables and Schemas

Tuesday, January 10, 12

Page 41: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved41

Data Types

• The usual scalar types:

•TINYINT, …, BIGNT.

•FLOAT, DOUBLE.

•BOOLEAN.

•STRING.

Tuesday, January 10, 12

Like most databases...

Page 42: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved42

Data Types

• The unusual complex types:

•STRUCT.

•MAP.

•ARRAY.

Tuesday, January 10, 12

Structs are like “objects” or “c-style structs”. Maps are key-value pairs, and you know what arrays are ;)

Page 43: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved43

CREATE TABLE employees ( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING,FLOAT>, address STRUCT< street:STRING, city:STRING, state:STRING, zip:INT>);

Tuesday, January 10, 12

subordinates references other records by the employee name. (Hive doesn’t have indexes, in the usual sense, but an indexing feature was recently added.) Deductions is a key-value list of the name of the deduction and a float indicating the amount (e.g., %). Address is like a “class”, “object”, or “c-style struct”, whatever you prefer.

Page 44: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved44

File & Record Formats

CREATE TABLE employees (…) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

All the defaults for text files!

Tuesday, January 10, 12

Suppose our employees table has a custom format and field delimiters. We can change them, although here I’m showing all the default values used by Hive!

Page 45: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved45

Select, Where,

Group By, Join,...

Tuesday, January 10, 12

Page 46: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved46

Common SQL...

• You get most of the usual suspects for SELECT, WHERE, GROUP BY and JOIN.

Tuesday, January 10, 12

We’ll just highlight a few unique features.

Page 47: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved47

ADD JAR MyUDFs.jar;

CREATE TEMPORARY FUNCTION net_salary AS 'com.example.NetCalcUDF';

SELECT name, net_salary(salary, deductions) FROM employees;

“User Defined Functions”

Tuesday, January 10, 12

Following a Hive defined API, implement your own functions, build, put in a jar, and then use them in your queries. Here we (pretend to) implement a function that takes the employee’s salary and deductions, then computes the net salary.

Page 48: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved48

SELECT name, salary FROM employees ORDER BY salary ASC;

ORDER BY vs. SORT BY

• A total ordering - one reducer.

SELECT name, salary FROM employees SORT BY salary ASC;

• A local ordering - sorts within each reducer.

Tuesday, January 10, 12

For a giant data set, piping everything through one reducer might take a very long time. A compromise is to sort “locally”, so each reducer sorts it’s output. However, if you structure your jobs right, you might achieve a total order depending on how data gets to the reducers. (E.g., each reducer handles a year’s worth of data, so joining the files together would be totally sorted.)

Page 49: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• Only equality (x = y).

49

Inner Joins

SELECT ... FROM clicks a JOIN clicks b ON ( a.uid = b.uid, a.day = b.day) WHERE a.process = 'movies' AND b.process = 'books' AND a.year > 2012;

Tuesday, January 10, 12

Note that the a.year > ‘…’ is in the WHERE clause, not the ON clause for the JOIN. (I’m doing a correlation query; which users searched for movies and books on the same day?) Some outer and semi join constructs supported, as well as some Hadoop-specific optimization constructs.

Page 50: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved50

A Final Example of Controlling MapReduce...

Tuesday, January 10, 12

Page 51: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• Calling out to external programs to perform map and reduce operations.

51

Specify Map & Reduce Processes

Tuesday, January 10, 12

Page 52: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved52

ExampleFROM ( FROM clicks MAP message USING '/tmp/vampire_extractor' AS item_title, count CLUSTER BY item_title) it INSERT OVERWRITE TABLE vampire_stuff REDUCE it.item_title, it.count USING '/tmp/thing_counter.py' AS item_title, counts;

Tuesday, January 10, 12

Note the MAP … USING and REDUCE … USING. We’re also using CLUSTER BY (distributing and sorting on “item_title”).

Page 53: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved53

ExampleFROM ( FROM clicks MAP message USING '/tmp/vampire_extractor' AS item_title, count CLUSTER BY item_title) it INSERT OVERWRITE TABLE vampire_stuff REDUCE it.item_title, it.count USING '/tmp/thing_counter.py' AS item_title, counts;

Call specific map and reduce

processes.

Tuesday, January 10, 12

Note the MAP … USING and REDUCE … USING. We’re also using CLUSTER BY (distributing and sorting on “item_title”).

Page 54: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved54

… And Also:FROM ( FROM clicks MAP message USING '/tmp/vampire_extractor' AS item_title, count CLUSTER BY item_title) it INSERT OVERWRITE TABLE vampire_stuff REDUCE it.item_title, it.count USING '/tmp/thing_counter.py' AS item_title, counts;

Like GROUP BY, but directs

output to specific

reducers.

Tuesday, January 10, 12

Note the MAP … USING and REDUCE … USING. We’re also using CLUSTER BY (distributing and sorting on “item_title”).

Page 55: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved55

… And Also:FROM ( FROM clicks MAP message USING '/tmp/vampire_extractor' AS item_title, count CLUSTER BY item_title) it INSERT OVERWRITE TABLE vampire_stuff REDUCE it.item_title, it.count USING '/tmp/thing_counter.py' AS item_title, counts;

How to populate an “internal”

table.

Tuesday, January 10, 12

Note the MAP … USING and REDUCE … USING. We’re also using CLUSTER BY (distributing and sorting on “item_title”).

Page 56: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved56

Hive: Conclusions

Tuesday, January 10, 12

Page 57: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• Not a real SQL Database.

• Transactions, updates, etc.

• … but features will grow.

• High latency queries.

• Documentation poor.57

Hive Disadvantages

Tuesday, January 10, 12

Page 58: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved

• Indispensable for SQL users.

• Easier than Java MR API.

• Makes porting data warehouse apps to Hadoop much easier.

58

Hive Advantages

Tuesday, January 10, 12

Page 59: Hive sq lfor-hadoop

Copyright  ©  2011-­‐2012,  Think  Big  Analy8cs,  All  Rights  Reserved59

Questions?

[email protected]

thinkbiganalytics.com

This presentation: github.com/deanwampler/Presentations

Tuesday, January 10, 12