42
Hive - Data Warehousing & Analytics on Hadoop Wednesday, June 10, 2009 Santa Clara Marriott Namit Jain, Zheng Shao Facebook

Hadoop Summit 2009 Hive

Embed Size (px)

DESCRIPTION

Hive talk at Hadoop Summit 2009

Citation preview

Page 1: Hadoop Summit 2009 Hive

Hive - Data Warehousing & Analytics on Hadoop

Wednesday, June 10, 2009 Santa Clara Marriott

Namit Jain, Zheng ShaoFacebook

Page 2: Hadoop Summit 2009 Hive

Agenda

» Introduction

» Facebook Usage

» Hive Progress and Roadmap

» Open Source Community

Facebook

Page 3: Hadoop Summit 2009 Hive

» Introduction

Facebook

Page 4: Hadoop Summit 2009 Hive

Why Another Data Warehousing System?

Data, data and more data~1TB per day in March 2008

~10TB per day today

Facebook

Page 5: Hadoop Summit 2009 Hive
Page 6: Hadoop Summit 2009 Hive

Lets try Hadoop…

» Pros› Superior in availability/scalability/manageability

› Efficiency not that great, but throw more hardware

› Partial Availability/resilience/scale more important than ACID

» Cons: Programmability and Metadata› Map-reduce hard to program (users know sql/bash/python)

› Need to publish data in well known schemas

» Solution: HIVE

Facebook

Page 7: Hadoop Summit 2009 Hive

Lets try Hadoop… (continued)

RDBMS> select key, count(1) from kv1 where key > 100 group by key;

vs.

$ cat > /tmp/reducer.sh

uniq -c | awk '{print $2"\t"$1}‘

$ cat > /tmp/map.sh

awk -F '\001' '{if($1 > 100) print $1}‘

$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1

$ bin/hadoop dfs –cat /tmp/largekey/part*

Facebook

Page 8: Hadoop Summit 2009 Hive

What is HIVE?

» A system for managing and querying structured data built on top of Hadoop› Map-Reduce for execution

› HDFS for storage

› Metadata on raw files

» Key Building Principles:› SQL as a familiar data warehousing tool

› Extensibility – Types, Functions, Formats, Scripts

› Scalability and Performance

Facebook

Page 9: Hadoop Summit 2009 Hive

Simplifying Hadoop

RDBMS> select key, count(1) from kv1 where key > 100 group by key;

vs.

hive> select key, count(1) from kv1 where key > 100 group by key;

Facebook

Page 10: Hadoop Summit 2009 Hive

» Facebook Usage

Facebook

Page 11: Hadoop Summit 2009 Hive

Data Warehousing at Facebook Today

Facebook

Web Servers Scribe Servers

Filers

Hive on Hadoop ClusterOracle RAC Federated MySQL

Page 12: Hadoop Summit 2009 Hive

Hive/Hadoop Usage @ Facebook

» Types of Applications:› Reporting

• Eg: Daily/Weekly aggregations of impression/click counts• SELECT pageid, count(1) as imps FROM imp_tableGROUP BY pageid WHERE date = ‘2009-05-01’;

• Complex measures of user engagement

› Ad hoc Analysis

• Eg: how many group admins broken down by state/country

› Data Mining (Assembling training data)

• Eg: User Engagement as a function of user attributes

› Spam Detection

• Anomalous patterns for Site Integrity

• Application API usage patterns

› Ad Optimization

Facebook

Page 13: Hadoop Summit 2009 Hive

Hadoop Usage @ Facebook

» Cluster Capacity:› 600 nodes

› ~2.4PB (80% used)

» Data statistics:› Source logs/day: 6TB

› Dimension data/day: 4TB

› Compression Factor ~5x (gzip)

» Usage statistics:› 3200 jobs/day with 800K tasks(map-reduce tasks)/day

› 55TB of compressed data scanned daily

› 15TB of compressed output data written to hdfs

› 150 active users within Facebook

Facebook

Page 14: Hadoop Summit 2009 Hive

» Hive Progress and Roadmap

Facebook

Page 15: Hadoop Summit 2009 Hive

» CREATE TABLE clicks(key STRING, value STRING)LOCATION '/hive/clicks'PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.TestSerDe' WITH SERDEPROPERTIES ('testserde.default.serialization.format'='\003');

Facebook

Page 16: Hadoop Summit 2009 Hive

Data Model

Facebook

Logical Partitioning

Hash Partitioning

clicks

HDFS MetaStore

/hive/clicks

/hive/clicks/ds=2008-03-25

/hive/clicks/ds=2008-03-25/0

Tables

Data LocationBucketing InfoPartitioning Cols

Metastore DB

Page 17: Hadoop Summit 2009 Hive

HIVE: Components

Facebook

HDFS

Hive CLIDDL QueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift CSV JSON..

ExecutionParser

Planner

DB

Web

UI

Optimizer

Page 18: Hadoop Summit 2009 Hive

Hive Query Language

» SQL› Subqueries in from clause

› Equi-joins

› Multi-table Insert

› Multi-group-by

» Sampling› SELECT s.key, count(1) FROM clicksTABLESAMPLE (BUCKET 1 OUT OF 32) s WHERE s.ds = ‘2009-04-22’ GROUP BY s.key

Facebook

Page 19: Hadoop Summit 2009 Hive

FROM pv_users

INSERT INTO TABLE pv_gender_sum

SELECT gender, count(DISTINCT userid)

GROUP BY gender

INSERT INTO DIRECTORY‘/user/facebook/tmp/pv_age_sum.dir’

SELECT age, count(DISTINCT userid)

GROUP BY age

INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’

SELECT age, count(DISTINCT userid)

GROUP BY age;

Facebook

Page 20: Hadoop Summit 2009 Hive

Hive Query Language (continued)

» Extensibility› Pluggable Map-reduce scripts

› Pluggable User Defined Functions

› Pluggable User Defined Types• Complex object types: List of Maps

› Pluggable Data Formats• Apache Log Format

Facebook

Page 21: Hadoop Summit 2009 Hive

FROM (

FROM pv_users

MAP pv_users.userid, pv_users.date

USING 'map_script‘

AS dt, uid

CLUSTER BY dt) map

INSERT INTO TABLE pv_users_reduced

REDUCE map.dt, map.uid

USING 'reduce_script'

AS date, count;

Facebook

Pluggable Map-Reduce Scripts

Page 22: Hadoop Summit 2009 Hive

Map Reduce Example

Facebook

Machine 2

Machine 1

<k1, v1><k2, v2><k3, v3>

<k4, v4><k5, v5><k6, v6>

<nk1, nv1><nk2, nv2><nk3, nv3>

<nk2, nv4><nk2, nv5><nk1, nv6>

LocalMap

<nk2, nv4><nk2, nv5><nk2, nv2>

<nk1, nv1><nk3, nv3><nk1, nv6>

GlobalShuffle

<nk1, nv1><nk1, nv6><nk3, nv3>

<nk2, nv4><nk2, nv5><nk2, nv2>

LocalSort

<nk2, 3>

<nk1, 2><nk3, 1>

LocalReduce

Page 23: Hadoop Summit 2009 Hive

Hive QL – Join

INSERT INTO TABLE pv_users

SELECT pv.pageid, u.age

FROM page_view pv

JOIN user u

ON (pv.userid = u.userid);

Facebook

Page 24: Hadoop Summit 2009 Hive

Hive QL – Join in Map Reduce

Facebook

key value

111 <1,1>

111 <1,2>

222 <1,1>

pageid userid time

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14

userid age gender

111 25 female

222 32 male

page_view

user

key value

111 <2,25>

222 <2,32>

Map

key value

111 <1,1>

111 <1,2>

111 <2,25>

key value

222 <1,1>

222 <2,32>

ShuffleSort

Pageid age

1 25

2 25

pageid age

1 32

Reduce

Page 25: Hadoop Summit 2009 Hive

Join Optimizations

» Map Joins› User specified small tables stored in hash tables on the

mapper backed by jdbm

› No reducer needed

INSERT INTO TABLE pv_users

SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age

FROM page_view pv JOIN user u

ON (pv.userid = u.userid);

» FutureExploit table/column statistics for deciding strategy

Facebook

Page 26: Hadoop Summit 2009 Hive

Hive QL – Map Join

Facebook

key value

111 <1,2>

222 <2>

pageid userid time

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14

userid age gender

111 25 female

222 32 male

page_view

user

Pageid age

1 25

2 25

1 32

Hash table

pv_users

Page 27: Hadoop Summit 2009 Hive

Hive QL – Group By

SELECT pageid, age, count(1)

FROM pv_users

GROUP BY pageid, age;

Facebook

Page 28: Hadoop Summit 2009 Hive

Hive QL – Group By in Map Reduce

Facebook

pageid age

1 25

1 25

pv_users

pageid age count

1 25 3

pageid age

2 32

1 25

Map

key value

<1,25> 2

key value

<1,25> 1

<2,32> 1

key value

<1,25> 2

<1,25> 1

key value

<2,32> 1

ShuffleSort

pageid age count

2 32 1

Reduce

Page 29: Hadoop Summit 2009 Hive

Group by Optimizations

» Map side partial aggregations› Hash-based aggregates

› Serialized key/values in hash tables

› 90% speed improvement on Query• SELECT count(1) FROM t;

» Load balancing for data skew

» Optimizations being Worked On:› Exploit pre-sorted data for distinct counts

› Exploit table/column statistics for deciding strategy

Facebook

Page 30: Hadoop Summit 2009 Hive

Columnar Storage

» CREATE table columnTable

(key STRING, value STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.ColumnarSerDe'

STORED AS RCFILE;

» Saved 25% of space compared with SequenceFile› Based on one of the largest tables (30 columns) inside Facebook

› Both are compressed with GzipCodec

» Speed improvements in progress› Need to propagate column-selection information to FileFormat

» *Contribution from Yongqiang He (outside Facebook)

Facebook

Page 31: Hadoop Summit 2009 Hive

Speed Improvements over Time

Date SVN Revision Major Changes Query A Query B Query C

2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec

2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec

3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec

4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec

6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec

Facebook

» QueryA: SELECT count(1) FROM t;» QueryB: SELECT concat(concast(concat(a,b),c),d) FROM t;» QueryC: SELECT * FROM t;

» Time measured is map-side time only (to avoid unstable shuffling time at reducer side). It includes time for decompression and compression (both using GzipCodec).

» * No performance benchmarks for Map-side Join yet.

Page 32: Hadoop Summit 2009 Hive

Overcoming Java Overhead

» Reuse objects› Use Writable instead of Java Primitives

› Reuse objects across all rows

› *40% speed improvement on Query C

» Lazy deserialization› Only deserialize the column when asked

› Very helpful for complex types (map/list/struct)

› *108% speed improvement on Query A

Facebook

Page 33: Hadoop Summit 2009 Hive

Generic UDF and UDAF

» Let UDF and UDAF accept complex-type parameters

» Integrate UDF and UDAF with Writables

public IntWritable evaluate(IntWritable a, IntWritable b) {

intWritable.set((int)(a.get() + b.get()));

return intWritable;

}

Facebook

Page 34: Hadoop Summit 2009 Hive

HQL Optimizations

» Predicate Pushdown

» Merging n-way join

» Column Pruning

Facebook

Page 35: Hadoop Summit 2009 Hive

» Open Source Community

Facebook

Page 36: Hadoop Summit 2009 Hive

Open Source Community

» 21 contributors and growing › 6 contributors within Facebook

» Contributors from:› Academia

› Other web companies

› Etc..

» 7 committers› 1 external to Facebook and looking to add more here

Facebook

Page 37: Hadoop Summit 2009 Hive

» 50 jiras fixed in last month

» 218 jiras still open

» 125 mails in last month on hive-user@

» 600 mails in last month on hive-dev@

» Various companies/universities› Adknowledge, Admob

› Berkeley, Chinese Academy of Science

» Demonstration in VLDB’2009

Facebook

Page 38: Hadoop Summit 2009 Hive

Deployment Options

» EC2› http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely

» Cloudera Virtual Machine› http://www.cloudera.com/hadoop-training-hive-tutorial

» Your own cluster› http://wiki.apache.org/hadoop/Hive/GettingStarted

» Hive can directly consume data on hadoop› CREATE EXTERNAL TABLE mytable (key STRING, value STRING)LOCATION '/user/abc/mytable';

Facebook

Page 39: Hadoop Summit 2009 Hive

Future Work

» Benchmark & Performance

» Integration with BI tools (through JDBC/ODBC)

» Indexing

» More on Hive Roadmap› http://wiki.apache.org/hadoop/Hive/Roadmap

» Machine Learning Integration

» Real-time Streaming

Facebook

Page 40: Hadoop Summit 2009 Hive

Information

» Available as a sub project in Hadoop- http://wiki.apache.org/hadoop/Hive(wiki)

- http://hadoop.apache.org/hive (home page)

- http://svn.apache.org/repos/asf/hadoop/hive (SVN repo)

- ##hive (IRC)

- Works with hadoop-0.17, 0.18, 0.19

» Release 0.3 is out and more are coming

» Mailing Lists: › hive-{user,dev,commits}@hadoop.apache.org

Facebook

Page 41: Hadoop Summit 2009 Hive

Contributors

» Aaron Newton

» Ashish Thusoo

» David Phillips

» Dhruba Borthakur

» Edward Capriolo

» Eric Hwang

» Hao Liu

» He Yongqiang

» Jeff Hammerbacher

» Johan Oskarsson

» Josh Ferguson

» Joydeep Sen Sarma

» Kim P.

Facebook

» Michi Mutsuzaki

» Min Zhou

» Namit Jain

» Neil Conway

» Pete Wyckoff

» Prasad Chakka

» Raghotham Murthy

» Richard Lee

» Shyam Sundar Sarkar

» Suresh Antony

» Venky Iyer

» Zheng Shao

Page 42: Hadoop Summit 2009 Hive

» Questions

Facebook