24

20081030linkedin

Embed Size (px)

DESCRIPTION

An Introduction to Hive

Citation preview

Page 1: 20081030linkedin
Page 2: 20081030linkedin

An Introduction to Hive:Components and Query Language

Jeff HammerbacherChief Scientist and VP of ProductOctober 30, 2008

Page 3: 20081030linkedin

Hive ComponentsA Leaky Database

▪Hadoop▪HDFS▪MapReduce (bundles Resource Manager and Job Scheduler)▪Hive▪Logical data partitioning▪Metastore (command line and web interfaces)▪Query Language▪Libraries to handle different serialization formats (SerDes)▪ JDBC interface

Page 4: 20081030linkedin

Related WorkGlaringly Incomplete

▪Gamma, Bubba, Volcano, etc.▪Google: Sawzall▪Yahoo: Pig▪IBM Research: JAQL▪Microsoft: SCOPE▪Greenplum: YAML MapReduce▪Aster Data: In-Database MapReduce▪Business.com: CloudBase

Page 5: 20081030linkedin

Hive Resources▪Facebook Mirror: http://mirror.facebook.com/facebook/hive▪Currently the best place to get the Hive distribution

▪Wiki page: http://wiki.apache.org/hadoop/Hive▪Getting started: http://wiki.apache.org/hadoop/Hive/GettingStarted▪Query language reference: http://wiki.apache.org/hadoop/Hive/HiveQL▪ Presentations: http://wiki.apache.org/hadoop/Hive/Presentations▪ Roadmap: http://wiki.apache.org/hadoop/Hive/Roadmap

▪Mailing list: [email protected]

▪ JIRA: https://issues.apache.org/jira/browse/HADOOP/component/12312455

Page 6: 20081030linkedin

Running HiveQuickstart▪<install Hadoop>▪wget http://mirror.facebook.com/facebook/hive/hadoop-0.19/dist.tar.gz▪ (Replace 0.19 with 0.17 if you’re still on 0.17)

▪ tar xvzf dist.tar.gz▪cd dist▪export HADOOP=<path to bin/hadoop in your Hadoop distribution>▪Or: edit hadoop.bin.path and hadoop.conf.dir in conf/hive-default.xml

▪bin/hive▪hive>

Page 7: 20081030linkedin

Running HiveConfiguration Details▪conf/hive-default.xml▪hadoop.bin.path: Points to bin/hadoop in your Hadoop installation▪hadoop.config.dir: Points to conf/ in your Hadoop installation▪hive.exec.scratchdir: HDFS directory where execution information is written▪hive.metastore.warehouse.dir: HDFS directory managed by Hive▪ The rest of the properties relate to the Metastore

▪conf/hive-log4j.properties▪Will put data into /tmp/{user.name}/hive.log by default

▪conf/jpox.properties▪ JPOX is a Java object persistence library used by the Metastore

Page 8: 20081030linkedin

Populating HiveMovieLens Data▪ <cd into your hive directory>▪ wget http://www.grouplens.org/system/files/ml-data.tar__0.gz▪ tar xvzf ml-data.tar__0.gz▪ CREATE TABLE u_data (userid INT, movieid INT, rating INT, unixtime TIMESTAMP)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';▪ The first query can take ten seconds or more, as the Metastore needs to be created

▪ To confirm our table has been created:▪ SHOW TABLES;▪ DESCRIBE u_data;

▪ LOAD DATA LOCAL INPATH 'ml-data/u.data'OVERWRITE INTO TABLE u_data;

▪ SELECT COUNT(1) FROM u_data;▪ Should fire off 2 MapReduce jobs and ultimately return a count of 100,000

Page 9: 20081030linkedin

Hive Query LanguageUtility Statements▪ SHOW TABLES [table_name | table_name_pattern]

▪ DESCRIBE [EXTENDED] table_name[PARTITION (partition_col = partition_col_value, ...)]

▪ EXPLAIN [EXTENDED] query_statement

▪ SET [EXTENDED]

▪ “SET property_name=property_value” to modify a value

Page 10: 20081030linkedin

Hive Query LanguageCREATE TABLE Syntax▪ CREATE [EXTERNAL] TABLE table_name (col_name data_type [col_comment], ...)[PARTITIONED BY (col_name data_type [col_comment], ...)][CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS][ROW FORMAT row_format][STORED AS file_format][LOCATION hdfs_path]

▪ PARTITION columns are virtual columns; they are not part of the data itself but are derived on load▪ CLUSTERED columns are real columns, hash partitioned into num_buckets folders▪ ROW FORMAT can be used to specify a delimited data set or a custom deserializer▪ Use EXTERNAL with ROW FORMAT, STORED AS, and LOCATION to analyze HDFS files in place▪ “DROP TABLE table_name” can reverse this operation▪ NB: Currently, DROP TABLE will delete both data and metadata

Page 11: 20081030linkedin

Hive Query LanguageCREATE TABLE Syntax, Part Two▪ data_type: primitive_type | array_type | map_type▪ primitive_type: ▪ TINYINT | INT | BIGINT | BOOLEAN | FLOAT | DOUBLE | STRING▪ DATE | DATETIME | TIMESTAMP

▪ array_type: ARRAY < primitive_type > ▪ map_type: MAP < primitive_type, primitive_type >▪ row_format:▪ DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char][MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]

▪ SERIALIZER serde_name [WITH PROPERTIES property_name=property_value, property_name=property_value, ...]

▪ file_format: SEQUENCEFILE | TEXTFILE

Page 12: 20081030linkedin

Hive Query LanguageALTER TABLE Syntax▪ ALTER TABLE table_name RENAME TO new_table_name;▪ ALTER TABLE table_name ADD COLUMNS (col_name data_type [col_comment], ...);▪ ALTER TABLE DROP partition_spec, partition_spec, ...;

▪ Future work:▪ Support for removing or renaming columns▪ Support for altering serialization format

Page 13: 20081030linkedin

Hive Query LanguageLOAD DATA Syntax▪ LOAD DATA [LOCAL] INPATH '/path/to/file'[OVERWRITE] INTO TABLE table_name[PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)]

▪ You can load data from the local filesystem or anywhere in HDFS (cf. CREATE TABLE EXTERNAL)

▪ If you don’t specify OVERWRITE, data will be appended to existing table

Page 14: 20081030linkedin

Hive Query LanguageSELECT Syntax▪ [insert_clause]SELECT [ALL|DISTINCT] select_listFROM [table_source|join_source][WHERE where_condition][GROUP BY col_list][ORDER BY col_list][CLUSTER BY col_list]

▪ insert_clause: INSERT OVERWRITE destination

▪ destination:

▪ LOCAL DIRECTORY '/local/path'▪ DIRECTORY '/hdfs/path'▪ TABLE table_name [PARTITION (partition_col = partiton_col_value, ...)]

Page 15: 20081030linkedin

Hive Query LanguageSELECT Syntax▪ join_source: table_source join_clause table_source join_clause table_source ...

▪ join_clause

▪ [LEFT OUTER|RIGHT OUTER|FULL OUTER] JOIN ON (equality_expression, equality_expression, ...)

▪ Currently, only outer equi-joins are supported in Hive.

▪ There are two join algorithms

▪ Map-side merge join

▪ Reduce-side merge join

Page 16: 20081030linkedin

Hive Query LanguageBuilding a Histogram of Review Counts▪ CREATE TABLE review_counts (userid INT, review_count INT);▪ INSERT OVERWRITE TABLE review_countsSELECT a.userid, COUNT(1) AS review_countFROM u_data aGROUP BY a.userid;

▪ SELECT b.review_count, COUNT(1)FROM review_counts bGROUP BY b.review_count;

▪ Notes:▪ No INSERT OVERWRITE for second query means output is dumped to the shell▪ Hive does not currently support CREATE TABLE AS▪ We have to create the table and then INSERT into it

▪ Hive does not currently support subqueries▪ We have to write two queries

Page 17: 20081030linkedin

Hive Query LanguageRunning Custom MapReduce▪ Put the following into weekday_mapper.py:▪ import sysimport datetime

for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print ','.join([userid, movieid, rating, str(weekday)])

▪ CREATE TABLE u_data_new (userid INT, movieid INT, rating INT, weekday INT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;

▪ FROM u_data aINSERT OVERWRITE TABLE u_data_newSELECT TRANSFORM (a.userid, a.movieid, a.rating, a.unixtime) AS (userid, movieid, rating, weekday) USING ‘python /full/path/to/weekday_mapper.py’

Page 18: 20081030linkedin

Hive Query LanguageProgrammatic Access

▪The Hive shell can take a file with queries to be executed▪bin/hive -f /path/to/query/file

▪You can also run a Hive query straight from the command line▪bin/hive -e 'quoted query string'

▪A simple JDBC interface is available for experimentation as well▪https://issues.apache.org/jira/browse/HADOOP-4101

Page 19: 20081030linkedin

Hive ComponentsMetastore

▪Currently uses an embedded Derby database for persistence▪While Derby is in place, you’ll need to put it into Server Mode to have more than one Hive concurrent Hive user▪See http://wiki.apache.org/hadoop/HiveDerbyServerMode▪Next release will use MySQL as default persistent data store▪The goal is have the persistent store be pluggable▪You can view the Thrift IDL for the metastore online▪ https://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/hive/metastore/if/hive_metastore.thrift

Page 20: 20081030linkedin

Hive ComponentsQuery Processing▪Compiler▪ Parser▪ Type Checking▪ Semantic Analysis▪ Plan Generation▪ Task Generation

▪Execution Engine▪ Plan▪Operators▪UDFs and UDAFs

Page 21: 20081030linkedin

Future Directions▪ Query Optimization▪ Support for Statistics▪ These stats are needed to make optimization decisions

▪ Join Optimizations▪ Map-side joins, semi join techniques etc to do the join faster

▪ Predicate Pushdown Optimizations▪ Pushing predicates just above the table scan for certain situations in joins as well as ensuring that only required columns are sent across map/reduce boundaries

▪ Group By Optimizations▪ Various optimizations to make group by faster

▪ Optimizations to reduce the number of map files created by filter operations▪ Filters with a large number of mappers produces a lot of files which slows down the following operations.

Page 22: 20081030linkedin

Future Directions▪ MapReduce Integration▪ Schema-less MapReduce▪ TRANSFORM needs a schema while MapReduce is schema-less.

▪ Improvements to TRANSFORM▪ Make this more intuitive to MapReduce developers - evaluate some other keywords, etc.

▪ User Experience▪ Create a web interface▪ Error reporting improvements for parse errors▪ Add “help” command to the CLI▪ JDBC driver to enable traditional database tools to be used with Hive

Page 23: 20081030linkedin

Future Directions▪ Integrating Dynamic SerDe with the DDL▪ This allows the users to create typed tables along with list and map types from the DDL

▪ Transformations in LOAD DATA▪ LOAD DATA currently does not transform the input data if it is not in the format expected by the destination table.

▪ Explode and Collect Operators▪ Explode and collect operators to convert collections to individual items and vice versa.

▪ Propagating sort properties to destination tables▪ If the query produces sorted we want to capture that in the destination table's metadata so that downstream optimizations can be enabled.

Page 24: 20081030linkedin

(c) 2008 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc. All rights reserved. 1.0