51
Impala Benchmarks and Tuning Tips Simon Hsu 徐瑞興 2014913

Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

Embed Size (px)

DESCRIPTION

Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

Citation preview

Page 1: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

Impala Benchmarks and Tuning Tips

Simon Hsu徐瑞興

2014年9月13日

Page 2: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

2HadoopCon 2014

About Me

• 徐瑞興 (Simon Hsu)

– Approach Hadoop in M.S. (2010)

• “A Transparent Approach to Run MapReduce Programs on Collaborative Hadoops” – IEEE BigData 2014

– FOXCONN – RD Dept.

• Hadoop Product Development

– Etu – RD Dept.

• Hadoop Solution (Etu/Cloudera) / Product Development

Page 3: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

3HadoopCon 2014

Outline

• Impala Performance Tuning Tips

– “Practical Performance Analysis and Tuning for Cloudera Impala” - Greg Rahn @ Hadoop World 2013

• Impala Benchmarks

– TPC-DS Kit for Impala

Page 4: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

4HadoopCon 2014

Brief History of Impala

http://mt.orz.at/archives/2012/12/hadoop.html

Page 5: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

5HadoopCon 2014

Brief History of Impala

http://mt.orz.at/archives/2012/12/hadoop.html

Page 6: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

6HadoopCon 2014

Brief History of Impala

http://mt.orz.at/archives/2012/12/hadoop.html

Page 7: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

7HadoopCon 2014

Brief History of Impala

http://mt.orz.at/archives/2012/12/hadoop.html

Page 8: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

8HadoopCon 2014

Brief History of Impala

http://mt.orz.at/archives/2012/12/hadoop.html

Page 9: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

9HadoopCon 2014

Hive & Impala

• Running MapReduce Jobs

Page 10: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

10HadoopCon 2014

Hive & Impala

• Running by In-memory,

distributed SQL query engine

• Running MapReduce Jobs

Page 11: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

11HadoopCon 2014

Impala Feature

• Fast

– Low latency response

• Bypass HDFS DataNode (Read directly from disk)

• Optimized for data warehouse queries (Especially, Parquet)

• Friendly to approach

– Using the same database metadata with Hive

• Benefits in some tools such as Sqoop

– Common HDFS Files Format supported

• Query existing files on HDFS

Page 12: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

12HadoopCon 2014

No more predictions in

length of columns!

Page 13: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

13HadoopCon 2014

Impala Overview

http://www.slideshare.net/cloudera/impala-v1update130709222849phpapp01

12

3

4

5

Page 14: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

14HadoopCon 2014

Impala Performance Tuning Tips

Pre-execution

• Data Types

• Partitioning

• File Format

• Compression

Query Execution

• Gather Table / Column Stats

• Join Type

• Query Profile

Overall Review

• Use Case

• Experience

Page 15: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

15HadoopCon 2014

http://www.safaribooksonline.com/library/view/strata-conference-new/9781491945551/part131.html

Page 16: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

16HadoopCon 2014

Impala Performance Tuning Tips

Pre-execution

• Data Types

• Partitioning

• File Format

• Compression

Query Execution

• Gather Table / Column Stats

• Join Type

• Query Profile

Overall Review

• Use Case

• Experience

Page 17: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

17HadoopCon 2014

Pre-execution

• Data Types

• Partitioning

• File Format

• Compression

Page 18: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

18HadoopCon 2014

Data Types

• Change data type to appropriate one

– Avoid type casting overhead

• Ex.

– TimeStamps for time

– INT for IntegerAlthough String is powerful..

Page 19: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

19HadoopCon 2014

Partition

• Create table partitions to reduce disk IO

– Depends on general query pattern

• Partitioned by Month

• Partitioned by State

Page 20: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

20HadoopCon 2014

Partition Files in HDFS

Table files with partitions

Table files without partitions

Directories

Files

Page 21: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

21HadoopCon 2014

Query Test in partitions

with partition

without partition

Page 22: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

22HadoopCon 2014

File Format

• Text

– Default Impala table format

• Parquet

– Optimized for working with large data files

• typically 1 GB per file

– Reorganize data for maximum performance of data warehouse-style queries

• Column-oriented binary file format

Page 23: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

23HadoopCon 2014

Compression

• Snappy

Less CPU time

Lower compression ratio

• Gzip

More CPU time

Higher compression ratio

Page 24: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

24HadoopCon 2014

• Test Table– Number of records: 183,364,043

• Test Query– [master.etu.im:21000] > SELECT COUNT(*) FROM store_sales;

• Setting Compression codec– [master.etu.im:21000] > SET parquet.compression=[SNAPPY/GZIP/NONE/etc.]

Query Time in different compression codec

Codec Table Size on HDFS (GB) Query Time (s)

Snappy 9.2 0.91

Gzip 6.8 1.22

None 16.5 1.21

Page 25: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

25HadoopCon 2014

Compression Codec differs

http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

Page 26: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

26HadoopCon 2014

Compression Codec differs (cont.)

http://www.safaribooksonline.com/library/view/strata-conference-new/9781491945551/part131.html

Page 27: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

27HadoopCon 2014

Impala Performance Tuning Tips

Pre-execution

• Data Types

• Partitioning

• File Format

• Compression

Query Execution

• Gather Table / Column Stats

• Join Type

• Query Profile

Overall Review

• Use Case

• Experience

Page 28: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

28HadoopCon 2014

Query Execution

• Gather Table / Column Stats

• Join Type

• Query Profile

Page 29: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

29HadoopCon 2014

Usage of Explain Clause

Query Time : 0.31 (s)

Query Time : 2.21 (s)

with partition

without partition

• Query :

– [master.etu.im:21000] > explain select * from store_sales where ss_sold_date_skbetween 2451911 and 2451941 limit 10;

Page 30: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

30HadoopCon 2014

Compute Tables Stats

• [master.etu.im:21000] > COMPUTE STATS customer;

• [master.etu.im:21000] > SHOW TABLE STATS customer ;

Page 31: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

31HadoopCon 2014

Compute Tables Stats

• [master.etu.im:21000] > COMPUTE STATS customer;

• [master.etu.im:21000] > SHOW TABLE STATS customer ;

各位觀眾, 2個檔

Page 32: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

32HadoopCon 2014

Gather Column Stats

• [master.etu.im:21000] > SHOW COLUMN STATS tpcds_parquet.customer;

Page 33: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

33HadoopCon 2014

Join Type

• Two Types of Join

– Broadcast Join

• Default Join. Typically, broadcast joins are more efficient in cases where one table is much smaller than the other.

– Shuffle Join

• Typically, shuffle joins are more efficient for joins between large tables of similar size.

• Join Order Optimization

– If automatic optimization is not sufficient

• consider add STRAIGHT_JOIN after SELECT

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_hints.html

Page 34: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

34HadoopCon 2014

Query Profile

• Impalad Web console http://Impalad_IP:25000/

Page 35: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

35HadoopCon 2014

Impala Performance Tuning Tips

Pre-execution

• Configurations Check

• Data Types

• Partitioning

• File Format

Query Execution

• Gather Table / Column Stats

• Join Type

• Query Profile

Overall Review

• Use Case

• Experience

Page 36: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

36HadoopCon 2014

Overall Review

• Use Case

• Experience

Page 37: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

37HadoopCon 2014

Use case

• Use case in Partition

– L.T.V. of online gaming

• Average Days

• Average deposit

• How many people in each interval

http://goo.gl/TPoqvk

Page 38: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

38HadoopCon 2014

Use case

• Use case in File Format

– Improve the query time in hospital

• Reduce Query Time to 30%~50%

• Number of Columns in each tables: 40~50 columns

• Number of Records in largest table: over 100,000,000

“Taking a rest helps going further. “

http://goo.gl/RL6LSa

Page 39: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

39HadoopCon 2014

Notes in Configs

• HDFS Replication bandwith

– dfs.datanode.balance.bandwidthPerSec

• Default value : 10 MB/s

• Memory usage in impala daemon

– Impala Daemon Memory Limit

• (ex.) mem_limit : 80%

• Enable HDFS Short Circuit Read

– dfs.client.read.shortcircuit = true

Page 40: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

40HadoopCon 2014

Notes during Operations

• Preserve parquet block size

– $ bin/hadoop distcp –pb srcPath dstPath

• Create external table / Create table

– Preserve raw data or not while dropping table

• Be aware of Insert into ….value ..

– Generate many small files

Page 41: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

41HadoopCon 2014

Turn off Beauty Print (-B)

Page 42: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

42HadoopCon 2014

Impala Benchmarks

• TPC Benchmark™DS (TPC-DS)

– The New Decision Support Benchmark Standard

• Although the underlying business model of TPC-DS is a retail product supplier, the database schema, data population, queries, data maintenance model and implementation rules have been designed to be broadly representative of modern decision support systems.

https://github.com/cloudera/impala-tpcds-kit

Page 43: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

43HadoopCon 2014

Procedure of TPC-DS Benchmark (Impala)

Preparation

• tpcds-env.sh

• hdfs-mkdirs.sh

Data Generation

• gen-dims.sh

• gen-facts.sh

Data Loading

• impala-create-external-tables.sh

• impala-load-dims.sh

• impala-load-store_sales.sh

Page 44: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

44HadoopCon 2014

Store Sales ER-Diagram

http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf

Fact Table

Page 45: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

45HadoopCon 2014

Query 7 – Intro.

• Compute the average quantity, list price, discount, and sales price for promotional items sold in stores where the promotion is not offered by mail or a special event.

– Restrict the results to a specific gender, marital and educational status.

Page 46: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

46HadoopCon 2014

• selecti_item_id,avg(ss_quantity) agg1,avg(ss_list_price) agg2,avg(ss_coupon_amt) agg3,avg(ss_sales_price) agg4

• fromstore_sales,customer_demographics,date_dim,item,promotion

• wheress_sold_date_sk = d_date_skand ss_item_sk = i_item_skand ss_cdemo_sk = cd_demo_skand ss_promo_sk = p_promo_skand cd_gender = 'F'and cd_marital_status = 'W'and cd_education_status = 'Primary'and (p_channel_email = 'N'or p_channel_event = 'N')

and d_year = 1998and ss_sold_date_sk between 2450815 and 2451179

• group byi_item_id

• order byi_item_id

• limit 100;

http://www.minddevelopmentanddesign.com/blog/leaving-las-vagues-or-focus-your-seo-keywords/

Page 47: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

47HadoopCon 2014

Page 48: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

48HadoopCon 2014

Page 49: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

49HadoopCon 2014

Page 50: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

50HadoopCon 2014

Conclusion

• Consider the table format : “Parquet”

• Compression codec tradeoffs

• Disk I/O reduction by table partitioning

• See Query profiles for more information

• Run Impala Benchmarks and enjoy yourself

– TPC-DS (Decision Support Benchmark)

Page 51: Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

318, Rueiguang Rd., Taipei 114, TaiwanSimon Hsu – Sr. Software [email protected]

Thank you