6
Apache Hive Tutorials HQL – Hive Query Language Prashant Kumar Pandey

Hive Tutorials - Hive Query Language

Embed Size (px)

DESCRIPTION

Learn Apache HIve Step By Step

Citation preview

Page 2: Hive Tutorials - Hive Query Language

Learn more with us at http://www.thriveschool.blogspot.in/

We have learned creating hive tables and loading hive data into those tables. Next step should

be executing hive queries on those tables. I am not going to explain basics of SQL as I assume

you are already familiar with SQL for relational databases. If you do not know SQL, step back

and learn SQL basics before you start leaning hive query language. I am assuming that you have

gone through all of my previous articles on this blog so I will not cover those concepts which I

have already explained in other articles. If you are directly landing at this page first time, I

recommend you to start reading from beginning to learn hive. I have followed a logical

sequence for hive learning in this blog.

Editing HQL on hive CLI is troublesome. So we will write our HQL in a file and execute script

from Hive CLI, in case of any errors or changes, we go back to our script and change it there and

execute it again.

If you have already executed HQL on CLI, you must have noticed that Hive CLI throws lot of

messages before printing query results. I feel little lost in those messages and want to disable it

and see only query output on the screen. To do this we have to start hive in silent mode which

you can do using hive –S to start hive CLI instead of using only hive command.

Let’s test it. I have a file named q.hql which I have stored in my home directory. It has a simple

select statement as shown in below screen.

Now, I will start hive in silent mode using hive –S and execute this file using hive CLI as shown in

below screen.

Ok, now you know how to start hive in silent mode and how to execute hql from a source file.

Hive query language is also known as HQL. It mostly resembles with SQL syntaxes of MYSQL

database but there are lots of differences. I need a suitable table, loaded with appropriate data

to demonstrate HQL. For this purpose, I created an employee 100+ employee records. I have

shared data file for your reference and download here.

First step is to create an external table and map this data. I will not load it further into managed

table and omit partitioning as our objective is to focus on HQL. You have already learned

creating a partitioned managed table and loading data into it from external table. If you want,

you can do it and execute all HQLs demonstrated here on your partitioned hive managed

tables. You should not expect and differences in final output.

Page 3: Hive Tutorials - Hive Query Language

Learn more with us at http://www.thriveschool.blogspot.in/

Command to create and external table is shown below.

You need to execute this command in hive CLI and copy data file in HDFS at location /user/pkp/hql

before you start executing hql statements. Our table employee is mostly simple except three columns

contacts, address and job_history.

Contacts column is a map which means it can store multiple key-value pairs like office phone, email,

mobile etc. we will see it in more detail when we execute hql statements on them.

Address column is a struct with three members as city, street and zip. We will see how we can

manipulate them in hql.

job_history is the most complex one. It’s an array which means it can contain multiple values starting

with array index 0. Array and maps are similar in hive as they both can store multiple values. Difference

is how they represent values. Array will use index starting from 0 whereas map will use key to identify

value. If you don’t get it now, don’t worry, we you will get it well with examples shown below.

To make this array more complex and to showcase that hive support nesting of complex data types, I

made it an array of struc. We will see how we can work with nested data types.

Let’s execute first and most simple hql statement, selecting only columns with primitive data type i.e.

string and int in our case.

And output is as below.

Page 4: Hive Tutorials - Hive Query Language

Learn more with us at http://www.thriveschool.blogspot.in/

Simple, no explanation required. Working with primitive data types is simple and most of your SQL skills

will work here. Complex data types like maps, structure and array are the one we will focus and learn

how to work with them.

Let’s select a map type column i.e. contacts for these two records an output is shown below.

You will notice following.

1. Hive returns them using JSON notation. You will see it same for other complex data types aslo.

2. There are three map-key pairs for each record. These are office, email and mobile.

3. This clarifies why I mentioned maps are similar to array because we can see that single column

contacts actually contains three values. Does it not appear like an array of three elements?

4. How do I access an individual element of a map?

Let’s do it.

Okay, I used array notation to access single element but instead of using index number, I used key. I

have used where clause in above example just to emphasize that your SQL skills will work mostly in

hql. I will not mention such things further and you should notice it yourself.

Let’s select a struct column.

Page 5: Hive Tutorials - Hive Query Language

Learn more with us at http://www.thriveschool.blogspot.in/

How do I access an individual element of a struct? Check below.

Let’s select an array column.

Did you notice that there are two elements in the array? An entire record is enclosed in [ ]. Array

elements are enclosed in { }. If you look at the output there are two sets of { } separated by comma,

each representing one array element. Let’s select only first element of the array in next screen.

Now you understand that there were two elements in the array and looking at only one element is

simple as using array notation with index value.

Each element of the array is again a struct of three members as you can see in the above screen.

Let’s select only one member of first element of the array.

I hope by now you learned how to access complex columns in hive and how to access if they are

nested. I will leave you at this stage to explore your SQL skills in HQL.

Page 6: Hive Tutorials - Hive Query Language

Learn more with us at http://www.thriveschool.blogspot.in/

Few notes before we conclude this post.

You can use where clause in HQL as you might be using in SQL. Most of the operators are same like =, !=,

<, IS NULL, IS NOT NULL LIKE etc. You can also use mathematical functions like round, floor, ceil, pow,

sqrt etc. You can use aggregate functions like count, sum, avg, min, max etc. HQL also supports having

clause along with aggregate functions. HQL also supports string functions like length, reverse, upper,

lower, substr, trim, concat etc. You can use CASE WHEN THEN clause also. You can use sub queries like

you use in SQL. You can use order by and union as well.

Joins are also supported with some limitations (only equijoins with AND, no OR is supported). Hive goes

left to right when joining multiple tables and WHERE clauses are evaluated after joins are performed.

In summary, most of you SQL skills will work here. Let me show another example of nested query syntax

supported in HQL which you may find different than some of the RDBMS.

There are lot of other things to cover in HQL which we will cover individually later. Keep learning.