Upload
chen-gwen-shapira
View
2.271
Download
1
Embed Size (px)
DESCRIPTION
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax. This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
Citation preview
1
Impala:Modern, Open-Source SQL EngineFor HadoopGwen Shapira@[email protected]
2
Agenda
• Why Hadoop?• Data Processing in Hadoop• User’s view of Impala• Impala Use Cases• Impala Architecture• Performance highlights
3
In the beginning….
was the database
4
For a while, the database was all we needed.
5
Data is not what it used to beD
ata
Gro
wth
STRUCTURED DATA – 20%
1980 2012
UNSTRUCTURED DATA – 80%
6
Hadoop was Invented to Solve:
• Large volumes of data• Data that is only valuable in bulk• High ingestion rates• Data that requires more processing• Differently structured data• Evolving data• High license costs
7
What is Apache Hadoop?
Has the Flexibility to Store and Mine Any Type of Data
Ask questions across structured and unstructured data that were previously impossible to ask or solve
Not bound by a single schema
Excels atProcessing Complex Data
Scale-out architecture divides workloads across multiple nodes
Flexible file system eliminates ETL bottlenecks
ScalesEconomically
Can be deployed on commodity hardware
Open source platform guards against vendor lock
Hadoop Distributed File System (HDFS)
Self-Healing, High Bandwidth Clustered
Storage
MapReduce
Distributed Computing Framework
Apache Hadoop is an open source platform for data storage and processing that is…
Distributed Fault tolerant Scalable
CORE HADOOP SYSTEM COMPONENTS
8
Processing Data in Hadoop
9
Map Reduce
• Versatile• Flexible• Scalable
• High latency• Batch oriented• Java• Challenging paradigm
10
Hive & Pig
• Hive – Turn SQL into MapReduce• Pig – Turn execution plans into MapReduce• Makes MapReduce easier• But not any faster
11
Towards a Better Map Reduce
• Spark – Next generation MapReduceWith in-memory cachingLazy EvaluationFast recovery times from node failures
• Tez – Next generation MapReduce. Reduced overhead, more flexibility.Currently Alpha
12
And now to something completely different!
13
What is Impala?
14
Impala Overview
Interactive SQL for Hadoop Responses in seconds Nearly ANSI-92 standard SQL with Hive SQL
Native MPP Query Engine Purpose-built for low-latency queries Separate runtime from MapReduce Designed as part of the Hadoop ecosystem
Open Source Apache-licensed
Impala OverviewRuns directly within Hadoop
reads widely used Hadoop file formats talks to widely used Hadoop storage managers runs on same nodes that run Hadoop processes
High performance C++ instead of Java runtime code generation completely new execution engine – No MapReduce
Beta version released since October 2012 General availability (v1.0) release out since April 2013 Latest release (v1.2.3) released on December 23rd
Impala is Production Ready
User View of Impala: Overview
• Distributed service in cluster: one Impala daemon on each node with data
• Highly available: no single point of failure• Submit query to any daemon:
• ODBC/JDBC• Impala CLI• Hue
• Query is distributed to all nodes with relevant data• Impala uses Hive’s metadata
User View of Impala: File Formats
• There is no ‘Impala format’. • Impala supports:
• Uncompressed/lzo-compressed text files• Sequence files and RCFile with snappy/gzip
compression• Avro data files• Parquet columnar format (more on that later)• HBase
User View of Impala: SQL Support• Most of SQL-92• INSERT INTO … SELECT …• Only equi-joins; no non-equi joins, no cross products• Order By requires Limit (for now)• DDL support• SQL-style authorization via Apache Sentry (incubating)
• UDFs and UDAFs are supported
20
Use Cases
21
Impala Use Cases
Interactive BI/analytics on more data
Asking new questions – exploration, ML
Data processing with tight SLAs
Query-able archive w/full fidelity
Cost-effective, ad hoc query environment that offloads the data warehouse for:
22
Global Financial Services Company
Saved 90% on incremental EDW spend &improved performance by 5x
Offload data warehouse for query-able archive
Store decades of data cost-effectively
Process & analyze on the same system
Improved capabilities through interactive query on more data
24
Digital Media Company
20x performance improvement for exploration & data discovery
Easily identify new data sets for modeling
Interact with raw data directly to test hypotheses
Avoid expensive DW schema changes
Accelerate ‘time to answer’
25
Impala Architecture
Impala Architecture
• Impala daemon (impalad) – N instances• Query execution
• State store daemon (statestored) – 1 instance• Provides name service and metadata distribution
• Catalog daemon (catalogd) – 1 instance• Relays metadata changes to all impalad’s
27
Impala Query Execution
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBCHive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request
1) Request arrives via ODBC/JDBC/HUE/Shell
28
Impala Query Execution
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBCHive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
2) Planner turns request into collections of plan fragments3) Coordinator initiates execution on impalad(s) local to data
29
Impala Query Execution
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBCHive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
4) Intermediate results are streamed between impalad(s)5) Query results are streamed back to client
Query results
30
Query Planner
2-phase planning Left deep tree Partition plan to maximize data locality
Join order Before 1.2.3: Order of tables in query. 1.2.3 and above: Cost based if statistics exist
Plan Operators Scan, HashJoin, HashAggregation, Union, TopN, Exchange All operators are fully distributed
31
Query Execution Example
Simple Example
SELECT state, SUM(revenue)FROM HdfsTbl h JOIN HbaseTbl b ON (id)GROUP BY state ORDER BY 2 desc LIMIT 10
How does a database execute a query?
• Left Deep Tree• Data flows from bottom
to top TopN
Agg
HashJoin
HdfsScan
HbaseScan
Wait – Why is this a left-deep tree?
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
Agg
HashJoin
Scan: t0
How does a database execute a query?
• Hash Join Node fills the hash table with the RHS table data.
• So, the RHS table (Hbase scan) is scanned first.
TopN
Agg
HashJoin
HdfsScan
HbaseScan
Scan Hbase
first
How does a database execute a query?
• Hash Join Node fills the hash table with the RHS table data. TopN
Agg
HashJoin
HdfsScan
HbaseScan
How does a database execute a query?
• Hash Join Node fills the hash table with the RHS table data. TopN
Agg
HashJoin
HdfsScan
HbaseScan
How does a database execute a query?
• Hash Join Node fills the hash table with the RHS table data. TopN
Agg
HashJoin
HdfsScan
HbaseScan
How does a database execute a query?
• Start scanning LHS (Hdfs) table
• For each row from LHS, probe the hash table for matching rows
TopN
Agg
HashJoin
HdfsScan
HbaseScan
Probe hash table and a matching row is found.
How does a database execute a query?
• Matched rows are bubbled up the execution tree TopN
Agg
HashJoin
HdfsScan
HbaseScan
How does a database execute a query?
• Continue scanning the LHS (Hdfs) table
• For each row from LHS, probe the hash table for matching rows
• Unmatched rows are discarded
TopN
Agg
HashJoin
HdfsScan
HbaseScan
No matching row
How does a database execute a query?
• Continue scanning the LHS (Hdfs) table
• For each row from LHS, probe the hash table for matching rows
• Unmatched rows are discarded
TopN
Agg
HashJoin
HdfsScan
HbaseScan
How does a database execute a query?
• Continue scanning the LHS (Hdfs) table
• For each row from LHS, probe the hash table for matching rows
• Unmatched rows are discarded
TopN
Agg
HashJoin
HdfsScan
HbaseScan
Probe hash table and a matching row is found.
How does a database execute a query?
• Matched rows are bubbled up the execution tree TopN
Agg
HashJoin
HdfsScan
HbaseScan
How does a database execute a query?
• Continue scanning the LHS (Hdfs) table
• For each row from LHS, probe the hash table for matching rows
• Unmatched rows are discarded
TopN
Agg
HashJoin
HdfsScan
HbaseScan
No matching row
How does a database execute a query?
• All rows have been returned from the hash join node. Agg node can start returning rows
• Rows are bubbled up the execution tree
TopN
Agg
HashJoin
HdfsScan
HbaseScan
How does a database execute a query?
• Rows from the aggregation node bubbles up to the top-n node
TopN
Agg
HashJoin
HdfsScan
HbaseScan
How does a database execute a query?
• Rows from the aggregation node bubbles up to the top-n node
• When all rows are returned by the agg node, top-n node can restart return rows to the end-user
TopN
Agg
HashJoin
HdfsScan
HbaseScan
49
Key takeaways Data flows from bottom to top in the execution tree
and finally goes to the end user Larger tables go on the left Collect statistics Filter early
Simpler Example
SELECT state, SUM(revenue)FROM HdfsTbl h JOIN HbaseTbl b ON (id)GROUP BY state
How does an MPP database execute a query?
Tbl bScan
HashJoin
Tbl aScan
Exch
Agg
Exch
AggAgg
HashJoin
Tbl aScan
Tbl bScan
Broadcast
Re-distribute by “state”
How does a MPP database execute a query
A join B
A join B
A join B
Local Agg
Local Agg
Local Agg
Scan and Broadcast Tbl B
Final Agg
Final Agg
Final Agg
Re-distribute by “state”
Local read Tbl A
53
Performance
54
Impala Performance Results
• Impala’s Latest Milestone:• Comparable commercial MPP DBMS speed• Natively on Hadoop
• Three Result Sets:• Impala vs Hive 0.12 (Impala 6-70x faster)• Impala vs “DBMS-Y” (Impala average of 2x faster)• Impala scalability (Impala achieves linear scale)
• Background• 20 pre-selected, diverse TPC-DS queries (modified to remove unsupported
language)• Sufficient data scale for realistic comparison (3 TB, 15 TB, and 30 TB)• Realistic nodes (e.g. 8-core CPU, 96GB RAM, 12x2TB disks)• Methodical testing (multiple runs, reviewed fairness for competition, etc)
• Details: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/
55
Impala vs Hive 0.12 (Lower bars are better)
56
Impala vs “DBMS-Y” (Lower bars are better)
57
Impala Scalability: 2x the Hardware(Expectation: Cut Response Times in Half)
58
Impala Scalability: 2x the Hardware and 2x Users/Data(Expectation: Constant Response Times)
2x the Users, 2x the Hardware
2x the Data, 2x the Hardware
59