Upload
atscaleinc
View
328
Download
7
Embed Size (px)
Citation preview
The BI for Hadoop Benchmark
Q1 2016
atscale.com/benchmark
2© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Hadoop Use Cases have evolved
74%
62%65%
ETL Data Science Business Intelligence
51% 56%69%
ETL Data Science Business Intelligence
Yesterday Today
atscale.com/survey
3© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Self-Service leads to Business Value
atscale.com/survey
41%61%
59%39%
No Access Self Service
Companies that provide self-service
access to business units are 50% more likely
to gain value out of Hadoop
4© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Most Don’t Have Self-Service on Hadoop
atscale.com/survey
Close to 60% have not provided self-service access to Hadoop yet
41%
59%
YesNo
5© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Why Self-Service is so Hard
1. Current BI Tools are limited
2. Hadoop is not optimized for performance
3. Governance and security are an issue
4. Current approaches are unnatural
atscale.com/benchmark
The BI for Hadoop Benchmark
Q1 2016
atscale.com/benchmark
7© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark FrameworkThree key concepts need to be inspected when evaluating SQL-on-Hadoop engines and their fitness to satisfy Business Intelligence workloads:
q Performs on Big Data: the SQL-on-Hadoop engine must be able to consistently analyze billions or trillions of rows of data without generating errors and with response times on the order of 10s or 100s of seconds.
q Fast on Small Data: the engine needs to deliver interactive performance on known query patterns and as such it is important that the SQL-on-Hadoop engine return results in no greater than a few seconds on small data sets (on the order of thousands or millions of rows).
q Stable for Many Users: Enterprise BI user bases consist of hundreds or thousands of data works, and as a result the underlying SQL-on-Hadoop engine must perform reliably under highly concurrent analysis workloads.
atscale.com/benchmark
8© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark QueriesData Set: Star Schema Benchmark (SSB) data set
6B rows, 13 queries, 3 patterns
1. “Quick Metric” queries: Compute a particular metric value for a period of time. These queries have a small number of joins and minimal or no group-bys (Q1.1 - Q1.3)
2. “Product Insight” queries: Compute a metric (or several metrics) aggregated against a set of product and date based dimensions. These queries include “medium” sized joins and a small number of group-bys (Q2.1 - Q2.3)
3. “Customer Insight”: Compute a metric (or several metrics) aggregated against a set of product, customer, and date-based dimensions. These queries include both “medium” and “very large” sized joins as well as a number of group-bys (Q3.1 - Q4.3)
atscale.com/benchmark
9© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Un-Aggregated Results
atscale.com/benchmark
10© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark Key Findingsq One engine does not fit all: Depending on raw data size, query complexity, and the target number of
end-users enterprises will find that one engine can’t accomplish it all. Each engine has its own ‘sweet spot’ and enterprises may find that a blended usage SQL-on-Hadoop engines might fit their company’s goals better.
q Small vs. Big Data: While all query engines successfully completed the “Large Data” query tests, Spark SQL and Impala performed better on smaller data sets - tables with thousands or several million rows of data.
q Few vs. Many Users: Impala has shown the best concurrency test results, over Hive and Spark-SQL. Companies that anticipate connecting large numbers of business users to Hadoop may want to consider Impala.
q Constant Innovation: Open source contribution, as seen by Spark SQL improvements, provides constant innovation. We expect the industry to continue innovating here: for example, Cloudera donated the Impala project to the Apache Software Foundation this past November. There is no doubt more innovation will come out from this new development.
atscale.com/benchmark
Environment Details
atscale.com/benchmark
12© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmarks: Environment
RAM per node 128G
CPU specs for data (worker) nodes 32 CPU cores
Storage specs for data (worker) nodes 2x 512mb SSD
For our test environment we used an 12 node cluster with:• 1 master node• 1 gateway node• 10 data nodes
13© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmarks: Data SetTable Name
Number of Rows
CUSTOMER_SMALL 30M
CUSTOMER 1B
LINEORDER 6B
SUPPLIER 2M
PART 2M
DATE 16K
14© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmarks: QueriesQuery ID Number of Joins Largest Join Table Number of Group Bys Number of Filters Comments
Q1.1 1 16,799 0 3 1 range condition, 1 comparative filter condition directly on LINEORDER table
Q1.2 1 16,799 0 3 2 range filter conditions directly on LINEORDER table
Q1.3 1 16,799 0 42 range filter conditions directly on LINEORDER table, 2 conditions on joined table
Q2.1 3 2,000,000 2 2 filter on p_category ( less selective)
Q2.2 3 2,000,000 2 2 filter on p_brand, 2 values ( more selective)
Q2.3 3 2,000,000 2 2 filter on p_brand, 1 value ( most selective)
Q3.1 3 1,050,000,000 3 3 filter on region ( less selective)
Q3.2 3 1,050,000,000 3 3 filter on nation ( more selective)
Q3.3 3 1,050,000,000 3 3 filter on city ( most selective)
Q3.4 3 1,050,000,000 3 3 filter on city ( most selective) and month ( vs. year)
Q4.1 4 1,050,000,000 2 2
Q4.2 4 1,050,000,000 3 3 includes filter on year ( more selective)
Q4.3 4 1,050,000,000 3 3 includes filter on year and nation ( most selective)
About AtScale
atscale.com/benchmark
16© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
AtScale Intelligence Platform
I.T. needsControl & Consistency
The Business needsFreedom & Self-Service
The Business Interface for Hadoop
17© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Superior Architecture
q Any BI tool
q Industry standards
q Schema on demand
q Write once