Upload
moai-kids
View
10.178
Download
3
Embed Size (px)
Citation preview
Programming Hive Reading #3
@just_do_neet
Programming Hive Reading #3
Chapter 10. Tuning
•Using Explain / Explain Extended
•Optimized Join
•Local Mode
•Parallel Execution
•Strict Mode
•Tuning the Number of Map/Reduce
•JVM etc...
3
Programming Hive Reading #3
Using EXPLAIN
4
•EXPLAIN使わなくて済むのは小学生ま(ry
•出力内容
•Abstract Syntax Tree(AST)
•Dependencies
•Stage Plans
Programming Hive Reading #3
Using EXPLAIN
5
•org.apache.hadoop.hive.ql.exec.ExplainTaskhttp://grepcode.com/file/repository.cloudera.com$content$repositories$releases@org.apache.hadoop.hive$hive-
[email protected]@org$apache$hadoop$hive$ql$exec$ExplainTask.java
Programming Hive Reading #3
Using EXPLAIN
6
•AST(抽象構文木)
•TOK_FROM:入力元(TOK_TABREF=table)
•TOK_INSERT:出力先
•TOK_SELECT:selectの条件
Programming Hive Reading #3
Using EXPLAIN
7
•Dependencies
•MapReduce Job / Sampling Stage / Merge Stage / Limit Stage / etc..
Programming Hive Reading #3
Using EXPLAIN
8
•Stage Plans
Programming Hive Reading #3
Using EXPLAIN
9
•Stage Plans:Operators
•“EXPLAIN EXTENDED”にするとより詳細な情報が出力される。(tmpファイルの出力先等)
http://hive.apache.org/docs/r0.7.1/api/org/apache/hadoop/hive/ql/exec/Operator.html
Programming Hive Reading #3
Optimized Join
10
•tableのデータ件数によって式を調整。 ex. stocks > dividends の場合
•最右辺に出現するテーブル:streamed(at reduce)それ以外:buffered
Programming Hive Reading #3
Optimized Join
11
•stream tableはhint句 ”STREAMTABLE(tbl_name)”で明示的に指定できる。
Programming Hive Reading #3
Optimized Join
12
•検証 a : 1,000,000,000 recordsb : 100,000,000 records
$ SELECT a.hoge, b.fuga FROM a JOIN b on (a.id = b.id)121.384 s
$ SELECT a.hoge, b.fuga FROM b JOIN a on (b.id = a.id)122.339 s
$ SELECT /*+ streamtable(a) */ a.hoge, b.fuga FROM b JOIN a on (b.id = a.id)120.298 s
Programming Hive Reading #3
Map Side Join
13
•再掲
Programming Hive Reading #3
Map Side Join
14
•再掲
Programming Hive Reading #3
Local Mode
15
•データサイズが小さい場合はLocal Modeの方がoverheadが減らせて速いケースがある。$ set mapred.job.tracker = local;$ set mapred.tmp.dir =/tmp/masashi/sada;$ SELECT * FROM hoge FROM id = ‘fuga’..........Job running in-process (local Hadoop)..........
Programming Hive Reading #3
Local Mode
16
•データサイズが小さい場合はLocal Modeの方がoverheadが減らせて速いケースがある。
•ex. 約30,000レコードのtable normal mode : 27s local mode : 10s
•ex. 約100,000,000レコードのtablenormal mode : 40slocal mode : 532s
Programming Hive Reading #3
Local Mode
17
•自動的にLocal Mode処理をさせるには“hive.exec.mode.local.auto=true”
•Local Mode動作する条件は以下• The total input size of the job is lower than:
“hive.exec.mode.local.auto.inputbytes.max” (128MB by default)• The total number of map-tasks is less than:
“hive.exec.mode.local.auto.tasks.max” (4 by default)• The total number of reduce tasks required is 1 or 0.
Programming Hive Reading #3
Strict Mode
18
•Tuning?
•有効にすると構文チェックが厳格になる。”hive.mapred.mode=strict”
Programming Hive Reading #3
Tuning M/R Number
19
•hive.exec.reducers.bytes.per.reducer = <number>
•hive.exec.reducers.max = <number>
•mapred.reduce.tasks = <number>
Programming Hive Reading #3
JVM Reuse
20
•1つのJVM上で動作するMap/Reduce Task数を設定可能。(at “mapred-site.xml”)
•-1の場合は無制限。
Programming Hive Reading #3
Dynamic Partition Tuning
21
•Dynamic Partitionの使用制約を設定可能。
Programming Hive Reading #3
Single MR Multi Group By
22
•参考:https://issues.apache.org/jira/browse/HIVE-2056
•上記の場合”hive.multigroupby.singlemr=true”のほうが速いらしい。
From table T insert overwrite table test1 select col1, count(distinct colx) group by col1 insert overwrite table test2 select col1, col2, count(distinct colx) group by col1, col2;
Programming Hive Reading #3
•Tuning?
•以下の情報はHiveQLを用いて取得可能、ならびに条件指定可能
•INPUT__FILE__NAME
•BLOCK__OFFSET__INSIDE__FILE
•ROW__OFFSET__INSIDE__BLOCK(“hive.exec.rowoffset=true”)
Virtual Columns
23
Programming Hive Reading #3
•Example
Virtual Columns
24
https://cwiki.apache.org/Hive/languagemanual-virtualcolumns.html
Programming Hive Reading #3
Conclusion
25
•実際のパフォーマンスチューニングには、上述の内容よりもデータ構造の改善の方が効果が大きいと思います。
•Chapter 11. ならびに Chapter 15. 担当の方に超期待しています!!!
ご清聴ありがとうございました