Cloudera Impala + PostgreSQL

Running Cloudera Impala on PostgreSQL

By Chengzhong [email protected]

2013.12

Story coming from…

• Data gravity• Why big data• Why SQL on big data

Today agenda

• Big data in Miaozhen 秒针系统• Overview of Cloudera Impala• Hacking practice in Cloudera Impala• Performance• Conclusions• Q&A

What happened in miaozhen

• 3 billion Ads impression per day• 20TB data scan for report generation every morning• 24 servers cluster

• Besides this– TV Monitor– Mobile Monitor– Site Monitor– …

Before Hadoop

• Scrat– PostgreSQL 9.1 cluster– Write a simple proxy – <2s for 2TB data scan

• Mobile Monitor– Hadoop-like distribute computing system– Rabbit MQ + 3 computing servers– Write a Map-Reduce in C++– Handles 30 millions to 500 millions Ads impression

Problem & Chance

• Database cluster• SQL on Hadoop• Miscellaneous data

• Requirements– Most data is rational– SQL interface

SQL on Hadoop

• Google Dremel• Apache Drill• Cloudera Impala• Facebook Presto• EMC Greenplum/Pivotal

HDFS

Map Reduce

HivePig

Impala/Drill/Pivotal/Presto

Latency matters

What’s this

• A kind of MPP engine• In memory processing• Small to big join– Broadcast join

• Small result size

Why Cloudera Impala

• The team move fast– UDF coming out– Better join strategy on the way

• Good code base– Modularize– Easy to add sub classes

• Really fast– Llvm code generation

• 80s/95s – uv test

– Distributed aggregation Tree– In-situ data processing (inside storage)

Typical Arch.SQL Interface Meta Store

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Our target

• A MPP database– Build on PostgreSQL9.1– Scale well– Speed

• A mixed data source MPP query engine– Join two tables in different sources– In fact…

Hacking… from where

• Add, not change– Scan Node type– DB Meta info

• Put changes in configuration– Thrift Protocol update• TDBHostInfo• TDBScanNode

Front end

• Meta store update– Link data to the table name– Table location management

• Front end– Compute table location

Back end

• Coordinator– pg host

• New scan node type– db scan node• Pg scan node• Psql library using cursor

SQL Plan

Aggr.: sum(count(id)

Exchange node

Aggr. : group by id

Aggr. : count(id)

HDFS/PG scan

Aggr. : group by id

Exchange node

• select count(distinct id) from table

– MR like process

Env.

• Ads impression logs– 150 millions, 100KB/line

• 3 servers– 24 cores– 32 G mem– 2T * 12 HD– 100Mbps LAN

• Query– Select count(id) from t group by campaign– Select count(distinct id) from t group by campaign– Select * from t where id = ‘xxxxxxxx’

Performance

1 2 30

100

200

300

400

500

600

700

impalahivepg+impala

• Group by speed / core• 20 M /s

With index

Codegen on/off

uv_test distinct duplicated0

10

20

30

40

50

60

70

80

90

100

en_codegendis_codegen

• select count(distinct id) from t group by c

• select distinct idfrom t

• select id from tgroup by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10;

Multi-users

Conclusion

• Source quality– Readable– Google C++ style– Robust

• MPP solution based on PG– Proved perf.– Easy to scale

• Mixed engine usage– HDFS and DB

What’s next

• Yarn integrating• UDF• Join with Big table• BI roadmap• Fail over

Rerf.

• Cloudera Impala online doc. & src• http://files.meetup.com/1727991/Impala%20

and%20BigQuery.ppt• http://www.cubrid.org/blog/dev-platform/me

et-impala-open-source-real-time-sql-querying-on-hadoop/

• http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/slides/Impala%20tech%20talk.pdf

• @datascientist, @dongxicheng, @flyingsk, @zhh

http://www.cubrid.org/blog/dev-platform/meet-impala-open-source-real-time-sql-querying-on-hadoop/





http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/slides/Impala%20tech%20talk.pdf

http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/slides/Impala%20tech%20talk.pdf

Thanks!

Q & A

Technology

Cloudera Impala + PostgreSQL