Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan

Impala Benchmarks and Tuning Tips

Simon Hsu徐瑞興

2014年9月13日

2HadoopCon 2014

About Me

• 徐瑞興 (Simon Hsu)

– Approach Hadoop in M.S. (2010)

• “A Transparent Approach to Run MapReduce Programs on Collaborative Hadoops” – IEEE BigData 2014

– FOXCONN – RD Dept.

• Hadoop Product Development

– Etu – RD Dept.

• Hadoop Solution (Etu/Cloudera) / Product Development

3HadoopCon 2014

Outline

• Impala Performance Tuning Tips

– “Practical Performance Analysis and Tuning for Cloudera Impala” - Greg Rahn @ Hadoop World 2013

• Impala Benchmarks

– TPC-DS Kit for Impala

4HadoopCon 2014

Brief History of Impala

http://mt.orz.at/archives/2012/12/hadoop.html


5HadoopCon 2014




6HadoopCon 2014




7HadoopCon 2014




8HadoopCon 2014




9HadoopCon 2014

Hive & Impala

• Running MapReduce Jobs

10HadoopCon 2014

Hive & Impala

• Running by In-memory,

distributed SQL query engine

• Running MapReduce Jobs

11HadoopCon 2014

Impala Feature

• Fast

– Low latency response

• Bypass HDFS DataNode (Read directly from disk)

• Optimized for data warehouse queries (Especially, Parquet)

• Friendly to approach

– Using the same database metadata with Hive

• Benefits in some tools such as Sqoop

– Common HDFS Files Format supported

• Query existing files on HDFS

12HadoopCon 2014

No more predictions in

length of columns!

13HadoopCon 2014

Impala Overview

http://www.slideshare.net/cloudera/impala-v1update130709222849phpapp01

12

3

4

5

http://www.slideshare.net/cloudera/impala-v1update130709222849phpapp01

14HadoopCon 2014

Impala Performance Tuning Tips

Pre-execution

• Data Types

• Partitioning

• File Format

• Compression

Query Execution

• Gather Table / Column Stats

• Join Type

• Query Profile

Overall Review

• Use Case

• Experience

15HadoopCon 2014

http://www.safaribooksonline.com/library/view/strata-conference-new/9781491945551/part131.html


16HadoopCon 2014


Pre-execution

• Data Types

• Partitioning

• File Format

• Compression

Query Execution


• Join Type

• Query Profile

Overall Review

• Use Case

• Experience

17HadoopCon 2014

Pre-execution

• Data Types

• Partitioning

• File Format

• Compression

18HadoopCon 2014

Data Types

• Change data type to appropriate one

– Avoid type casting overhead

• Ex.

– TimeStamps for time

– INT for IntegerAlthough String is powerful..

19HadoopCon 2014

Partition

• Create table partitions to reduce disk IO

– Depends on general query pattern

• Partitioned by Month

• Partitioned by State

20HadoopCon 2014

Partition Files in HDFS

Table files with partitions

Table files without partitions

Directories

Files

21HadoopCon 2014

Query Test in partitions

with partition

without partition

22HadoopCon 2014

File Format

• Text

– Default Impala table format

• Parquet

– Optimized for working with large data files

• typically 1 GB per file

– Reorganize data for maximum performance of data warehouse-style queries

• Column-oriented binary file format

23HadoopCon 2014

Compression

• Snappy

Less CPU time

Lower compression ratio

• Gzip

More CPU time

Higher compression ratio

24HadoopCon 2014

• Test Table– Number of records: 183,364,043

• Test Query– [master.etu.im:21000] > SELECT COUNT(*) FROM store_sales;

• Setting Compression codec– [master.etu.im:21000] > SET parquet.compression=[SNAPPY/GZIP/NONE/etc.]

Query Time in different compression codec

Codec Table Size on HDFS (GB) Query Time (s)

Snappy 9.2 0.91

Gzip 6.8 1.22

None 16.5 1.21

25HadoopCon 2014

Compression Codec differs

http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

26HadoopCon 2014

Compression Codec differs (cont.)



27HadoopCon 2014


Pre-execution

• Data Types

• Partitioning

• File Format

• Compression

Query Execution


• Join Type

• Query Profile

Overall Review

• Use Case

• Experience

28HadoopCon 2014

Query Execution


• Join Type

• Query Profile

29HadoopCon 2014

Usage of Explain Clause

Query Time : 0.31 (s)

Query Time : 2.21 (s)

with partition

without partition

• Query :

– [master.etu.im:21000] > explain select * from store_sales where ss_sold_date_skbetween 2451911 and 2451941 limit 10;

30HadoopCon 2014

Compute Tables Stats

• [master.etu.im:21000] > COMPUTE STATS customer;

• [master.etu.im:21000] > SHOW TABLE STATS customer ;

31HadoopCon 2014

Compute Tables Stats

• [master.etu.im:21000] > COMPUTE STATS customer;

• [master.etu.im:21000] > SHOW TABLE STATS customer ;

各位觀眾, 2個檔

32HadoopCon 2014

Gather Column Stats

• [master.etu.im:21000] > SHOW COLUMN STATS tpcds_parquet.customer;

33HadoopCon 2014

Join Type

• Two Types of Join

– Broadcast Join

• Default Join. Typically, broadcast joins are more efficient in cases where one table is much smaller than the other.

– Shuffle Join

• Typically, shuffle joins are more efficient for joins between large tables of similar size.

• Join Order Optimization

– If automatic optimization is not sufficient

• consider add STRAIGHT_JOIN after SELECT

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_hints.html

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_hints.html

34HadoopCon 2014

Query Profile

• Impalad Web console http://Impalad_IP:25000/

http://impalad_ip:25000/

35HadoopCon 2014


Pre-execution

• Configurations Check

• Data Types

• Partitioning

• File Format

Query Execution


• Join Type

• Query Profile

Overall Review

• Use Case

• Experience

36HadoopCon 2014

Overall Review

• Use Case

• Experience

37HadoopCon 2014

Use case

• Use case in Partition

– L.T.V. of online gaming

• Average Days

• Average deposit

• How many people in each interval

http://goo.gl/TPoqvk

http://goo.gl/TPoqvk

38HadoopCon 2014

Use case

• Use case in File Format

– Improve the query time in hospital

• Reduce Query Time to 30%~50%

• Number of Columns in each tables: 40~50 columns

• Number of Records in largest table: over 100,000,000

“Taking a rest helps going further. “

http://goo.gl/RL6LSa

http://goo.gl/RL6LSa

39HadoopCon 2014

Notes in Configs

• HDFS Replication bandwith

– dfs.datanode.balance.bandwidthPerSec

• Default value : 10 MB/s

• Memory usage in impala daemon

– Impala Daemon Memory Limit

• (ex.) mem_limit : 80%

• Enable HDFS Short Circuit Read

– dfs.client.read.shortcircuit = true

40HadoopCon 2014

Notes during Operations

• Preserve parquet block size

– $ bin/hadoop distcp –pb srcPath dstPath

• Create external table / Create table

– Preserve raw data or not while dropping table

• Be aware of Insert into ….value ..

– Generate many small files

41HadoopCon 2014

Turn off Beauty Print (-B)

42HadoopCon 2014

Impala Benchmarks

• TPC Benchmark™DS (TPC-DS)

– The New Decision Support Benchmark Standard

• Although the underlying business model of TPC-DS is a retail product supplier, the database schema, data population, queries, data maintenance model and implementation rules have been designed to be broadly representative of modern decision support systems.

https://github.com/cloudera/impala-tpcds-kit

https://github.com/cloudera/impala-tpcds-kit

43HadoopCon 2014

Procedure of TPC-DS Benchmark (Impala)

Preparation

• tpcds-env.sh

• hdfs-mkdirs.sh

Data Generation

• gen-dims.sh

• gen-facts.sh

Data Loading

• impala-create-external-tables.sh

• impala-load-dims.sh

• impala-load-store_sales.sh

44HadoopCon 2014

Store Sales ER-Diagram

http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf

Fact Table

http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf

45HadoopCon 2014

Query 7 – Intro.

• Compute the average quantity, list price, discount, and sales price for promotional items sold in stores where the promotion is not offered by mail or a special event.

– Restrict the results to a specific gender, marital and educational status.

46HadoopCon 2014

• selecti_item_id,avg(ss_quantity) agg1,avg(ss_list_price) agg2,avg(ss_coupon_amt) agg3,avg(ss_sales_price) agg4

• fromstore_sales,customer_demographics,date_dim,item,promotion

• wheress_sold_date_sk = d_date_skand ss_item_sk = i_item_skand ss_cdemo_sk = cd_demo_skand ss_promo_sk = p_promo_skand cd_gender = 'F'and cd_marital_status = 'W'and cd_education_status = 'Primary'and (p_channel_email = 'N'or p_channel_event = 'N')

and d_year = 1998and ss_sold_date_sk between 2450815 and 2451179

• group byi_item_id

• order byi_item_id

• limit 100;

http://www.minddevelopmentanddesign.com/blog/leaving-las-vagues-or-focus-your-seo-keywords/

http://www.minddevelopmentanddesign.com/blog/leaving-las-vagues-or-focus-your-seo-keywords/

47HadoopCon 2014

48HadoopCon 2014

49HadoopCon 2014

50HadoopCon 2014

Conclusion

• Consider the table format : “Parquet”

• Compression codec tradeoffs

• Disk I/O reduction by table partitioning

• See Query profiles for more information

• Run Impala Benchmarks and enjoy yourself

– TPC-DS (Decision Support Benchmark)

318, Rueiguang Rd., Taipei 114, TaiwanSimon Hsu – Sr. Software [email protected]

Thank you

Technology

Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan