47
Hive at Yahoo: Letters from the trenches PRESENTED BY Mithun Radhakrishnan, Chris Drome ⎪ June 10, 2015 2015 Hadoop Summit, San Jose, California

Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

Embed Size (px)

Citation preview

Page 1: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

Hive a t Yahoo : Le t te rs f r om the t renches

P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n , C h r i s D r o m e J u n e 1 0 , 2 0 1 5⎪

2 0 1 5 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a

Page 2: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

2

About myself

2014 Hadoop Summit, San Jose, California

Mithun Radhakrishnan Hive Engineer at Yahoo! Hive Committer and long-time

contributor› Metastore-scaling› Integration› HCatalog

[email protected] @mithunrk

Page 3: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

3

About myself

2014 Hadoop Summit, San Jose, California

Chris Drome Hive Engineer at Yahoo! Hive contributor [email protected]

Page 4: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

Recap

Page 5: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

5 2015 Hadoop Summit, San Jose, California

Page 6: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

6 2015 Hadoop Summit, San Jose, California

Page 7: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

7 2015 Hadoop Summit, San Jose, California

1 TB

› 6.2x speedup over Hive 0.10 (RCFile)• Between 2.5-17x

› Average query time: 172 seconds• Between 5-947 seconds

• Down from 729 seconds (Hive 0.10 RCFile)

› 61% queries completed in under 2 minutes› 81% queries completed in under 4 minutes

Page 8: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

8 2015 Hadoop Summit, San Jose, California

Explaining the speed-ups

Hadoop 2.x, et al. Apache Tez

› (Arbitrary DAG)-based Execution Engine› “Playing the gaps” between M&R

• Intermediate data and the HDFS

› Smart scheduling› Container re-use› Pipelined job start-up

Hive › Statistics› Vectorized Execution

ORC› PPD

Page 9: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

9

Expectations with Hive 0.13 production

2014 Hadoop Summit, San Jose, California

Tez would outperform M/R by miles Tez would enable better cluster utilization

› Use less resources

Tez (and dependencies) would be “production ready”› GUI for task logs, DAG overviews, swim-lanes› Speculative execution

Similarly, ORC and Vectorization› Support evolving schemas

Page 10: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

10 2015 Hadoop Summit, San Jose, California

The Y!Grid

18 Hadoop Clusters in YGrid› 41565 Nodes› Biggest cluster: 5728 Nodes› 1M jobs a day

Hadoop 2.6+ Large Datasets

› Daily, hourly, minute-level frequencies› Thousands of partitions, 100s of 1000s of files, TBs of data per partition› 580 PB of data, total

Pig 0.14 on Tez, Pig 0.11 Hive 0.13 on Tez HCatalog for interoperability Oozie for scheduling GDM for data-loading Spark, HBase, Storm, etc…

Page 11: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

11 2015 Hadoop Summit, San Jose, California

Data processing use cases

Grid usage› 30+ million jobs per month› 12+ million Oozie launcher jobs

Pig usage› Handles majority of data pipelines/ETL (~43% of jobs)

Hive usage› Relatively smaller niche› 632,000 queries per month (35% Tez)

HCatalog for Inter-operability› Metadata storage for all Hadoop data› Yahoo-scale› Pig pipelines with Hive analytics

Page 12: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

12 2015 Hadoop Summit, San Jose, California

Business Intelligence Tools

Tableau, MicroStrategy Power users

› Tableau Server for scheduled reports

Challenges:› Security

• ACLs, Authentication, Encryption over the wire

› Bandwidth• Transporting results over ODBC

• Limit result-set to 1000s-10000s of rows

• Aggregations

› Query Latency• Metadata queries

• Partition/Table scans

• Materialized views

Page 13: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

13 2015 Hadoop Summit, San Jose, California

Data producer owns the data› Unlike traditional DBs

Multi-paradigm data access/generation› Pig/Hive/MapReduce using HCatalog

Highly available metadata service UI for tracking/debugging jobs Execution engine should ideally support speculative execution

Non-negotiables for Hive upgrade at Yahoo!

Page 14: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

14 2015 Hadoop Summit, San Jose, California

Yahoo! Hive-0.13

Based on Apache Hive-0.13.1 Internal Yahoo! Patches (admin web-services, data discovery, etc.) Community patches to stabilize Apache Hive-0.13.1

› Tez

• HIVE-7544, HIVE-6748, HIVE-7112, …

› Vectorization

• HIVE-8163, HIVE-8092, HIVE-7188, HIVE-7105, HIVE-7514, …

› Failures

• HIVE-7851, HIVE-7459, HIVE-7771, HIVE-7396, …

› Optimizations

• HIVE-7231, HIVE-7219, HIVE-7203, HIVE-7052, …

› Data integrity

• HIVE-7694, HIVE-7494, HIVE-7045, HIVE-7346, HIVE-7232, …

Phased upgrades› Phase 1: 285 JIRAs› Phase 2: 23 JIRAs (HIVE-8781 and related dependencies)› Phase 3: 46 JIRAs (HIVE-10114 and related dependencies)

Page 15: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

15 2015 Hadoop Summit, San Jose, California

One remote Hive Metastore “instance”› 4 HCatalog Servers behind a hardware VIP

• L3DSR load balancer

• 96GB-128GB RAM, 16 core boxes

› Backed by Oracle RAC

About 10 Gateways› Interactive use of Hive (and Pig, Oozie, M/R)› hive.metastore.uris -> HCatalog

About 4 HiveServer2 instances› Ad Hoc queries, aggregation

Hive deployment (per cluster)

Page 16: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

16 Yahoo Confidential & Proprietary

Evolution of grid services at Yahoo!

Gateway Machines

GridOracleOracle RAC

Browser

HUE

Hive Server 2

BI Tools

HCatalogHCatalog

Page 17: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

17 2015 Hadoop Summit, San Jose, California

Query performance on very large data sets› HIVE-8292: Reading … has high overhead in MapOperator.cleanUpInputFileChangedOp

Split-generation on very large data sets› Tends to generate more splits (maps tasks) compared to M/R› Long split generation times› Hogging the Hadoop queues

• Wave factor vs multi-tenancy requirements

› HIVE-10114: Split strategies for ORC

Scaling problems with ATS› More of a problem with Pig workflows› 10K+ tasks/job are routine› AM progress reporting, heart-beating, memory usage› Hadoop 2.6.0.10+

Challenges experienced with Hive on Tez

Page 18: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

18 Yahoo Confidential & Proprietary

Page 19: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

19 2015 Hadoop Summit, San Jose, California

At Yahoo! Scale,› 100s of Databases per cluster› 100s of Tables per database› 100s of columns per Table› 1000s of Partitions per Table

• Larger tables: Thousands of partitions, per hour

• Millions of partitions every few days

• 10s of millions of partitions, over dataset retention period

Problems:› Metadata volume

• Database/Table/Partition IO Formats

• Record serialization details

• HDFS paths

• Statistics– Per partition

– Per column

Fast execution engines aren’t the whole picture

Page 20: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

Letters f rom the trenches

Page 21: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

21 2015 Hadoop Summit, San Jose, California

From: Another ETL pipeline.

To: The Yahoo Hive Team

Subject: Slow queries

YHive team,

My query fails with OutOfMemoryError. I tried increasing container size, but it still fails. Please help!

Here are my settings:set mapreduce.input.fileinputformat.split.maxsize=16777216;

set mapreduce.map.memory.mb=4096;

set mapreduce.reduce.memory.mb=4096;

set mapred.child.java.opts=“-Xmx1024m”

...

INSERT OVERWRITE TABLE my_table PARTITION( foo, bar, goo )

SELECT * FROM {

...

}

...

Page 22: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

22 2015 Hadoop Summit, San Jose, California

From: YET another ETL pipeline.

To: The Yahoo Hive Team

Subject: Slow UDF performance

YHive team,

Why does using a simple custom UDF cause queries to time out?

SELECT foo, bar, my_function( goo )

FROM my_large_table

WHERE ...

Page 23: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

23 2015 Hadoop Summit, San Jose, California

Page 24: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

24 2015 Hadoop Summit, San Jose, California

From: The ETL team

To: The Yahoo Hive Team

Subject: A small matter of size...

Dear YHive team,

We have partitioned our table using the following 6 partition keys: {hourly-timestamp, name, property, geo-location, shoe-size, and so on…}.

For a given timestamp, the combined cardinality of the remaining partition-keys is about 10000/hr.

If queries on partitioned tables are supposed to be faster, how come queries on our table take forever just to get off the ground?

Yours gigantically,

Project Grape Ape

Page 25: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

25 2015 Hadoop Summit, San Jose, California

Page 26: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

26 2015 Hadoop Summit, San Jose, California

Metadata volume and Query Execution time

Anatomy of a Hive query1. Compile query to AST

2. Thrift-call to Metastore, for partition list

3. Examine partitions, data-paths, etc. Construct physical query plan.

4. Run optimizers on the plan

5. Execute plan. (M/R, Tez).

Partition pruner:› Removes partitions that shouldn’t participate in the query.› In effect, remove input-directories from the Hadoop job.

Page 27: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

27 2015 Hadoop Summit, San Jose, California

The problems of large-scale metadata

Partition pruner is single-threaded› Query spans a day› Query spanning a week? 2 million partitions

Partition objects are huge:› HDFS Paths› IO Formats › Record Deserializer info› Data column schema

Datanucleus:› 1 Partition: Join 6 Oracle tables in the backend.

Thrift serialization/deserialization takes minutes.› *Minutes*.

Page 28: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

28 2015 Hadoop Summit, San Jose, California

Immediate workarounds

“Hive wasn’t originally designed for more than 10000s of partitions, total…”

Throw hardware at it› 4 HCatalog servers behind a hardware VIP› High-RAM boxes:

• 96GB-128 GB metastore processes

• Tune each to use 100 connections to the Oracle RAC

Client-side tuning› Increase hive.metastore.client.socket.timeout› Increase heap size as needed (container size)› Multi-threaded fstat operations

Page 29: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

29 2015 Hadoop Summit, San Jose, California

Fix the leaky/noisy bits

Metastore frequently ran out of memory:› Disable Hadoop FileSystem cache

• HIVE-3098, HDFS-3545

• FileSystem.CACHE used UGI.hashcode()– Compared Subjects for equality, not equivalence.

› Fixed Thrift 0.9

• TSaslServerTransport had circular references

• JVM couldn’t detect these for cleanup– WeakReferences are your friend

• Fix incompatibility with L3DSR pings

Data discovery from Oozie:› Use JMS notifications, on publication› Oozie Coordinators wake up on ActiveMQ notification, kick off dependent workflows› Reduced polling frequency

Page 30: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

30 2015 Hadoop Summit, San Jose, California

More fixes

Metadata-only queries:› SELECT DISTINCT tstamp FROM my_purple_table ORDER BY tstamp DESC LIMIT 1000;

› Replace HiveMetaStoreClient::getPartitions() with getPartitionNames().

› Local job, versus cluster.

Optimize the optimizer:› The first step in some optimizers:

• List<Partition> partitions = hiveMetaStoreClient.getPartitions( db, table, (short)-1 );

• Pray that the client and/or the metastore don’t run out of memory.

• Take a nap.

› Fixed PartitionPruner, MetadataOnlyOptimizer.

Page 31: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

31 2015 Hadoop Summit, San Jose, California

Long-term fixes:

DirectSQL short-circuits:› Datanucleus problems at scale

• (Yes, we are aware of the irony that might result from extrapolation.)

› Specific to the backing DB.

Compaction of Partition info:› HIVE-7223, HIVE-7576, HIVE-9845, etc.› Schema evolves infrequently› Partition-info rarely differs from table-info

– Except HDFS paths (which are super-strings)

› List<Partition> vs Iterator<Partition>

• PartitionSet abstraction– The delight of Inheritance in Thrift

• Reduced memory foot-prints

Page 32: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

32 2015 Hadoop Summit, San Jose, California

“The finest trick of The Devil was to persuade you that he does not exist.”

-- ???

Page 33: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

33 2015 Hadoop Summit, San Jose, California

Page 34: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

34 2015 Hadoop Summit, San Jose, California

Page 35: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

35 2015 Hadoop Summit, San Jose, California

Page 36: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

36 2015 Hadoop Summit, San Jose, California

From: A major reporting team

To: The Yahoo Hive Team

Subject: Urgent! Customer reports are borking.

Dear YHive team,

When we connect Tableau Server 8.3 to Y!Hive 0.12/0.13, it is unusably slow. Queries take too long to run, and time out.

We’d prefer not to change our query-code too much. How soon can Hive accommodate our simple queries?

Yours hysterically,

Project Zodiac

Page 37: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

37 2015 Hadoop Summit, San Jose, California

Analysis: The query

Non-const partition key predicates:› E.g.

WHERE utc_time <= from_unixtime(unix_timestamp()- 2*24*60*60, 'yyyyMMdd')

AND utc_time >= from_unixtime(unix_timestamp()- 32*24*60*60, 'yyyyMMdd')

› Solution: Use constant expressions where possible.› Fix: Hive 1.x supports dynamic partition pruning, and constant folding.

Costly joins with partitioned dimension tables:› E.g. › SELECT … FROM fact_table JOIN (SELECT * FROM dimension_table WHERE dt IN (SELECT MAX(dt) from dimension_table);

› Workaround: External “pointer” tables.› Fix: Dynamic partition pruning.

Page 38: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

38 2015 Hadoop Summit, San Jose, California

Analysis: The data

Data stored in TEXTFILE› Solution: Switch to columnar storage

• ORC, dictionary encoding, vectorization, predicate pushdown

Over-partitioning:› Too many partition keys› Diminishing returns with partition pruning› Solution: Eliminate partition keys, consider sorting

Small Part files› Hard-coded nReducers› E.g.

hive> dfs -count /projects/foo_stats;

9081 682735 1876847648672 /projects/foo.db/foo_stats

› Solution:

• set hive.merge.mapfiles=true;

• set hive.merge.mapredfiles=true;

• set hive.merge.tezfiles=true;

Page 39: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

39 2015 Hadoop Summit San Jose

We’re not done yet

Tez/ATS scaling Speed up split calculation Auto/Offline compaction Abuse detection Better handling of schema

evolution Skew Joins in Hive UDFs with JNI and configuring

LD_LIBRARY_PATH

Page 40: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

Quest ions?

Page 41: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

Backup

Page 42: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

42

YHive configuration settings:

2014 Hadoop Summit, San Jose, California

set hive.merge.mapfiles=false; -- Except when producing data.

set hive.merge.mapredfiles=false; -- Except when producing data.

set tez.merge.files=false; -- Except when producing data.

-- For ORC files.

-- dfs.blocksize=134217728; -- hdfs-site.xml

set orc.stripe.size=67108864; -- 64MB stripes.

set orc.compress.size=262144; -- 256KB compress buffer.

set orc.compress=ZLIB; -- Override to NONE, per table.

set orc.create.index=true; -- ORC indexes.

set orc.optimize.index.filter=true; -- Predicate pushdown with ORC index

set orc.row.index.stride=10000;

Page 43: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

43

YHive configuration settings: (contd)

2014 Hadoop Summit, San Jose, California

-- Delegation Token Store settings:

set hive.cluster.delegation.token.store.class=ZooKeeperTokenStore;

set hive.cluster.delegation.token.renew-interval=172800000;

(Start HCat Server with -Djute.maxbuffer=24MB -> 190K+ tokens.)

-- Data Nucleus settings:

set datanucleus.connectionPoolingType=DBCP; -- !(BoneCP).

set datanucleus.cache.level1.type=none;

set datanucleus.cache.level2.type=none;

set datanucleus.connectionPool.maxWait=200000;

set datanucleus.connectionPool.minIdle=0;

-- Misc.

set hive.metastore.event.listeners=com.yahoo.custom.JMSListener;

Page 44: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

44

Zookeeper Token Storage performance

2014 Hadoop Summit, San Jose, California

Jute Buffer Size (in MB) Max delegation token count

4MB 30K

8MB 60K

12MB 90K

16MB 130K

20MB 160K

24MB 190K

Page 45: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

45 2015 Hadoop Summit, San Jose, California

Why Hive on Tez?

Shark, Impala› Pre-emption for in-memory systems› Multi-tenant, shared clusters› Heterogeneous nodes› Existing ecosystem› Community-driven development

Shark› Good proof of concept, but was not production ready› Shuffle performance (at the time)› Hive on Spark – under active development

Page 46: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

46 2015 Hadoop Summit, San Jose, California

Analysis: Tableau/ODBC driver

Tableau has come a long way, but› Schema discovery

• SELECT * FROM my_large_table LIMIT 0;

• SELECT DISTINCT part_key FROM my_large_table;

› SQL dialect

• Depends on vendor-specific driver-name

› Schema metadata-scans

• 3 partition listings per query

› Miscellaneous problems:

• “Custom SQL” rewrites

• Trouble with quoting

tl;dr : Try to transition to Simba’s 2.0.x Drivers with Tableau 8.3.x

Page 47: Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

47 2015 Hadoop Summit, San Jose, California