Upload
liuknag
View
1.186
Download
11
Embed Size (px)
DESCRIPTION
Hacking Cloudera Impala for running on PostgreSQL cluster as MPP style. Performances under typical sql stmt and concurrence case are verified.
Citation preview
Story coming from…
• Data gravity• Why big data• Why SQL on big data
Today agenda
• Big data in Miaozhen 秒针系统• Overview of Cloudera Impala• Hacking practice in Cloudera Impala• Performance• Conclusions• Q&A
What happened in miaozhen
• 3 billion Ads impression per day• 20TB data scan for report generation every morning• 24 servers cluster
• Besides this– TV Monitor– Mobile Monitor– Site Monitor– …
Before Hadoop
• Scrat– PostgreSQL 9.1 cluster– Write a simple proxy – <2s for 2TB data scan
• Mobile Monitor– Hadoop-like distribute computing system– Rabbit MQ + 3 computing servers– Write a Map-Reduce in C++– Handles 30 millions to 500 millions Ads impression
Problem & Chance
• Database cluster• SQL on Hadoop• Miscellaneous data
• Requirements– Most data is rational– SQL interface
SQL on Hadoop
• Google Dremel• Apache Drill• Cloudera Impala• Facebook Presto• EMC Greenplum/Pivotal
HDFS
Map Reduce
HivePig
Impala/Drill/Pivotal/Presto
Latency matters
What’s this
• A kind of MPP engine• In memory processing• Small to big join– Broadcast join
• Small result size
Why Cloudera Impala
• The team move fast– UDF coming out– Better join strategy on the way
• Good code base– Modularize– Easy to add sub classes
• Really fast– Llvm code generation
• 80s/95s – uv test
– Distributed aggregation Tree– In-situ data processing (inside storage)
Typical Arch.SQL Interface Meta Store
Query Planner
Coordinator
Exec Engine
Query Planner
Coordinator
Exec Engine
Query Planner
Coordinator
Exec Engine
Our target
• A MPP database– Build on PostgreSQL9.1– Scale well– Speed
• A mixed data source MPP query engine– Join two tables in different sources– In fact…
Hacking… from where
• Add, not change– Scan Node type– DB Meta info
• Put changes in configuration– Thrift Protocol update• TDBHostInfo• TDBScanNode
Front end
• Meta store update– Link data to the table name– Table location management
• Front end– Compute table location
Back end
• Coordinator– pg host
• New scan node type– db scan node• Pg scan node• Psql library using cursor
SQL Plan
Aggr.: sum(count(id)
Exchange node
Aggr. : group by id
Aggr. : count(id)
HDFS/PG scan
Aggr. : group by id
Exchange node
• select count(distinct id) from table
– MR like process
Env.
• Ads impression logs– 150 millions, 100KB/line
• 3 servers– 24 cores– 32 G mem– 2T * 12 HD– 100Mbps LAN
• Query– Select count(id) from t group by campaign– Select count(distinct id) from t group by campaign– Select * from t where id = ‘xxxxxxxx’
Performance
1 2 30
100
200
300
400
500
600
700
impalahivepg+impala
• Group by speed / core• 20 M /s
With index
Codegen on/off
uv_test distinct duplicated0
10
20
30
40
50
60
70
80
90
100
en_codegendis_codegen
• select count(distinct id) from t group by c
• select distinct idfrom t
• select id from tgroup by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10;
Multi-users
Conclusion
• Source quality– Readable– Google C++ style– Robust
• MPP solution based on PG– Proved perf.– Easy to scale
• Mixed engine usage– HDFS and DB
What’s next
• Yarn integrating• UDF• Join with Big table• BI roadmap• Fail over
Rerf.
• Cloudera Impala online doc. & src• http://files.meetup.com/1727991/Impala%20
and%20BigQuery.ppt• http://www.cubrid.org/blog/dev-platform/me
et-impala-open-source-real-time-sql-querying-on-hadoop/
• http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/slides/Impala%20tech%20talk.pdf
• @datascientist, @dongxicheng, @flyingsk, @zhh
Thanks!
Q & A