Upload
nguyentuyen
View
217
Download
2
Embed Size (px)
Citation preview
Execution Environments for Distributed Computing
Apache Hive
EEDC 34330
Master in Computer Architecture, Networks and Systems - CANS
Homework number: 3Group number: EEDC-1
Group members:Hugo Pérez – [email protected]
Sergio Mendoza – [email protected] Fenoy – [email protected]
Outline● Introduction● Hive Database
○ Data Model○ Query Language
● Hive Arquitecture● Conclusions
Introduction
Origins on Facebook...
● Facebook has 500.000.000 logs per day
● Facebook shares a billion pieces of content daily
● Facebook stores a vast amount of data
Introduction
What's the problem?
● 250 million photos per day● 2.7 billion likes and comments per day● 2 billion total registered users● 100 billion friendships● ...
TOO MUCH DATA!!
Introduction
What is Apache Hive?
● Hive is a data warehouse infrastructure
Introduction
What is Apache Hive?
● Hive is a data warehouse infrastructure
and what is a Data Warehouse (DW)?
● a DW is a database for reporting and analysis
Introduction
How does Apache Hive works?
● Hive is built on top of Hadoop
● Hive stores data in the HDFS
● Hive compile SQL queries as MapReduce jobs and run the jobs in the cluster
Introduction
How does Apache Hive works?
HiveQL query
Introduction
How does a simple web app works?
MySQL query
Outline● Introduction● Hive Database
○ Data Model○ Query Language
● Hive Arquitecture● Conclusions
Hive structures data into the well-understood database concepts like tables, columns, rows.
Data Model
Hive defines a simple SQL-like query language, called QL
- Supports DDL and DML.
- Users can embed custom map-reduce scripts
- Supports UDF, UDAF and UDTF.
HiveQL
REDUCE subq2.school, subq2.meme, subq2.cnt USING ‘top10.py’ AS (school,meme,cnt)FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt FROM (MAP b.school, a.statusUSING ‘meme-extractor.py’ AS (school,meme)FROM status_updates a JOIN profiles b ON (a.userid = b.userid) ) subq1GROUP BY subq1.school, subq1.memeDISTRIBUTE BY school, memeSORT BY school, meme, cnt desc) subq2;
HiveQL Extract
Outline● Introduction● Hive Database
○ Data Model○ Query Language
● Hive Arquitecture● Conclusions
Architecture
Architecture
● External Interfaces - provides both user interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC
● Thrift Server exposes a very simple client API to execute HiveQL statements
● Metastore is the system catalog. All other components of Hive interact with the metastore.
Architecture
● Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution
● Compiler translates statements into a plan which consists of a DAG of map-reduce jobs
● The driver submits the individual map-reduce jobs from the DAG to the Execution Engine in a topological order
Metastore
The metastore is the system catalog which contains metadata about the tables stored in Hive.
● Database - is a namespace for tables.● Table - Metadata for table contains list of columns
and their types, owner, storage and SerDe information● Partition - Each partition can have its own columns and
SerDe and storage information
Query Compiler
● Parser transforms a query string to a parse tree representation.
● Semantic Analyzer transforms the parse tree to a block-based internal query representation.
● Logical Plan Generator converts the internal query representation to a logical plan, which consists of a tree of logical operators
● Optimizer performs multiple passes over the logical plan and rewrites it in several ways
● Physical Plan Generator converts the logical plan into a physical plan, consisting of a DAG of map-reduce jobs
Outline● Introduction● Hive Database
○ Data Model○ Query Language
● Hive Arquitecture● Conclusions
● Hive provides a solution to perform business intelligence of huge data on top of mature Hadoop map-reduce platform.
● The SQL-like HiveQL cuts off the learning curve compared with low-level map-reduce programs.
Conclusions
Questions?
Links:http://i.stanford.edu/~ragho/hive-icde2010.pdfhttp://www.vldb.org/pvldb/2/vldb09-938.pdfhttp://hive.apache.org/https://cwiki.apache.org/Hive/languagemanual-transform.htmlhttp://biggdata.blogspot.com/2011/04/refreshing-trendingtopics-website-data.htmlhttp://code.google.com/p/hive-mrc/wiki/AboutHiveCore