Business Intelligence on Hadoop Benchmark

The BI for Hadoop Benchmark

Q1 2016

atscale.com/benchmark

2© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

Hadoop Use Cases have evolved

74%

62%65%

ETL Data Science Business Intelligence

51% 56%69%

ETL Data Science Business Intelligence

Yesterday Today

atscale.com/survey


Self-Service leads to Business Value

atscale.com/survey

41%61%

59%39%

No Access Self Service

Companies that provide self-service

access to business units are 50% more likely

to gain value out of Hadoop


Most Don’t Have Self-Service on Hadoop

atscale.com/survey

Close to 60% have not provided self-service access to Hadoop yet

41%

59%

YesNo


Why Self-Service is so Hard

1. Current BI Tools are limited

2. Hadoop is not optimized for performance

3. Governance and security are an issue

4. Current approaches are unnatural


The BI for Hadoop Benchmark

Q1 2016



Benchmark FrameworkThree key concepts need to be inspected when evaluating SQL-on-Hadoop engines and their fitness to satisfy Business Intelligence workloads:

q Performs on Big Data: the SQL-on-Hadoop engine must be able to consistently analyze billions or trillions of rows of data without generating errors and with response times on the order of 10s or 100s of seconds.

q Fast on Small Data: the engine needs to deliver interactive performance on known query patterns and as such it is important that the SQL-on-Hadoop engine return results in no greater than a few seconds on small data sets (on the order of thousands or millions of rows).

q Stable for Many Users: Enterprise BI user bases consist of hundreds or thousands of data works, and as a result the underlying SQL-on-Hadoop engine must perform reliably under highly concurrent analysis workloads.



Benchmark QueriesData Set: Star Schema Benchmark (SSB) data set

6B rows, 13 queries, 3 patterns

1. “Quick Metric” queries: Compute a particular metric value for a period of time. These queries have a small number of joins and minimal or no group-bys (Q1.1 - Q1.3)

2. “Product Insight” queries: Compute a metric (or several metrics) aggregated against a set of product and date based dimensions. These queries include “medium” sized joins and a small number of group-bys (Q2.1 - Q2.3)

3. “Customer Insight”: Compute a metric (or several metrics) aggregated against a set of product, customer, and date-based dimensions. These queries include both “medium” and “very large” sized joins as well as a number of group-bys (Q3.1 - Q4.3)



Un-Aggregated Results



Benchmark Key Findingsq One engine does not fit all: Depending on raw data size, query complexity, and the target number of

end-users enterprises will find that one engine can’t accomplish it all. Each engine has its own ‘sweet spot’ and enterprises may find that a blended usage SQL-on-Hadoop engines might fit their company’s goals better.

q Small vs. Big Data: While all query engines successfully completed the “Large Data” query tests, Spark SQL and Impala performed better on smaller data sets - tables with thousands or several million rows of data.

q Few vs. Many Users: Impala has shown the best concurrency test results, over Hive and Spark-SQL. Companies that anticipate connecting large numbers of business users to Hadoop may want to consider Impala.

q Constant Innovation: Open source contribution, as seen by Spark SQL improvements, provides constant innovation. We expect the industry to continue innovating here: for example, Cloudera donated the Impala project to the Apache Software Foundation this past November. There is no doubt more innovation will come out from this new development.


Environment Details



Benchmarks: Environment

RAM per node 128G

CPU specs for data (worker) nodes 32 CPU cores

Storage specs for data (worker) nodes 2x 512mb SSD

For our test environment we used an 12 node cluster with:• 1 master node• 1 gateway node• 10 data nodes


Benchmarks: Data SetTable Name

Number of Rows

CUSTOMER_SMALL 30M

CUSTOMER 1B

LINEORDER 6B

SUPPLIER 2M

PART 2M

DATE 16K


Benchmarks: QueriesQuery ID Number of Joins Largest Join Table Number of Group Bys Number of Filters Comments

Q1.1 1 16,799 0 3 1 range condition, 1 comparative filter condition directly on LINEORDER table

Q1.2 1 16,799 0 3 2 range filter conditions directly on LINEORDER table

Q1.3 1 16,799 0 42 range filter conditions directly on LINEORDER table, 2 conditions on joined table

Q2.1 3 2,000,000 2 2 filter on p_category ( less selective)

Q2.2 3 2,000,000 2 2 filter on p_brand, 2 values ( more selective)

Q2.3 3 2,000,000 2 2 filter on p_brand, 1 value ( most selective)

Q3.1 3 1,050,000,000 3 3 filter on region ( less selective)

Q3.2 3 1,050,000,000 3 3 filter on nation ( more selective)

Q3.3 3 1,050,000,000 3 3 filter on city ( most selective)

Q3.4 3 1,050,000,000 3 3 filter on city ( most selective) and month ( vs. year)

Q4.1 4 1,050,000,000 2 2

Q4.2 4 1,050,000,000 3 3 includes filter on year ( more selective)

Q4.3 4 1,050,000,000 3 3 includes filter on year and nation ( most selective)

About AtScale



AtScale Intelligence Platform

I.T. needsControl & Consistency

The Business needsFreedom & Self-Service

The Business Interface for Hadoop


Superior Architecture

q Any BI tool

q Industry standards

q Schema on demand

q Write once