22
De H18207.1 Deployment and Configuration Dell EMC ECS: Interactive SQL Query Engine with Presto Building on-premise SQL ecosystem with Dell EMC ECS powered by Presto Abstract This paper describes official support and technical solution for building distributed and interactive SQL query engine with Dell EMC ECS object store powered by Presto. To empower enterprises to begin Interactive SQL, Machine Learning, Deep Learning and Artificial Intelligence with a path towards future growth. March 2020

Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

De

H18207.1

Deployment and Configuration

Dell EMC ECS: Interactive SQL Query Engine with Presto Building on-premise SQL ecosystem with Dell EMC ECS powered by Presto

Abstract This paper describes official support and technical solution for building distributed

and interactive SQL query engine with Dell EMC ECS object store powered by

Presto. To empower enterprises to begin Interactive SQL, Machine Learning,

Deep Learning and Artificial Intelligence with a path towards future growth.

March 2020

Page 2: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Revisions

2 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

Revisions

Date Description

March 2020 Initial release

March 2020 Update to Dell Technologies template, removed S3Select scope, add TPCDS example.

Acknowledgements

This paper was produced by the following:

Author: Kirankumar Bhusanurmath, Analytics Solutions Architect.

The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this

publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.

Use, copying, and distribution of any software described in this publication requires an applicable software license.

Copyright © 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell

Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [3/26/2020] [Deployment and Configuration] [H18207.1]

Page 3: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Table of contents

3 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

Table of contents

Revisions ..................................................................................................................................................................... 2

Acknowledgements ...................................................................................................................................................... 2

Table of contents ......................................................................................................................................................... 3

Executive summary ...................................................................................................................................................... 4

1 Solution overview .................................................................................................................................................. 5

1.1 Dell EMC ECS.............................................................................................................................................. 5

1.2 Presto .......................................................................................................................................................... 5

1.2.1 Use cases .................................................................................................................................................... 5

1.2.2 Presto Hive(S3) connector............................................................................................................................ 6

2 Solution architecture .............................................................................................................................................. 7

3 Solution implementation ......................................................................................................................................... 8

3.1 Dell EMC ECS setup .................................................................................................................................... 8

3.1.1 Create new namespace ................................................................................................................................ 8

3.1.2 Create users and update password .............................................................................................................. 8

3.1.3 Create buckets ............................................................................................................................................. 9

3.2 Configuring Presto...................................................................................................................................... 10

3.3 Configuring Hive Metastore as a Presto catalog.......................................................................................... 11

4 Solution validation................................................................................................................................................ 13

4.1 Solution configuration verification ............................................................................................................... 13

4.1.1 Hive Datawarehouse setup (validation purpose only) .................................................................................. 13

4.1.2 Presto server setup .................................................................................................................................... 14

4.1.3 Presto interactive SQL query engines validation ......................................................................................... 16

5 Conclusion........................................................................................................................................................... 21

A Technical support and resources ......................................................................................................................... 22

A.1 Related resources ...................................................................................................................................... 22

Page 4: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Executive summary

4 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

Executive summary

One of the key challenges in any digitization journey is the adoption of new analytics techniques. Given the

explosion of tools and frameworks, it can be difficult to know where to start and what choices preclude other

choices down the road. The enterprise wants to co-optimize for scalability, maintainability, security and cost.

This document is designed to empower enterprises to begin using Interactive SQL, Machine Learning, Deep

Learning and/or Artificial Intelligence with a path toward future growth. As you read through this paper, we will

do some code development, some integration work, and some configuration to build the on-premise

interactive SQL engine on Dell EMC ECS powered by Presto.

The overall takeaway for enterprises is that Presto’s capabilities translate remarkably well on to the Dell EMC

ECS object storage. The result is that Dell EMC ECS can and should be a foundational component of the

enterprise data stack. Its superior economics, infinite scalability and rich enterprise feature set means that it

wins the price/performance battle against Hadoop and other technologies in most cases.

Page 5: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution overview

5 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

1 Solution overview

1.1 Dell EMC ECS Dell Technologies offers the ECS appliance for cloud object storage. The appliance is available as either a

turnkey, fully integrated hardware appliance, or as a software appliance that can run on commodity or third-

party hardware. Administrators can customize and configure the hardware appliance depending on business

needs and expected growth. The software appliance is available free of charge for non-production

environments.

To access object storage on the ECS system, Dell EMC has built in support for various APIs, such as:

• Amazon S3

• Openstack swift

• Dell EMC Atmos

• Dell EMC Centera CAS

Along with the array of supported APIs used to interact with object storage, ECS also provides support for

various storage protocols such as HDFS and NFS. Providing multiple protocols can help hardware resources

while keeping support for legacy applications that might need NFS.

As Dell EMC ECS is available in multiple configurations depending on the business need and necessary

capacity, the cost can also be structured based on the available budget. Dell EMC Financial Services can

provide flexible options for the procurement of Dell EMC hardware, including financing. Unlike many public

cloud solutions, Dell EMC ECS doesn’t charge monthly per GB, HTTP GET request, or data egress, helping

to limit ‘hidden’ or unexpected costs.

Administrators can deploy the ECS solution anywhere in the world, as the only requirement is space in the

datacenter. Business can deploy the ECS appliance directly or to service providers to provide object storage

to clients. When businesses deploy ECS on their premise, they have absolute control over where data

resides.

1.2 Presto Presto is a fast, interactive, open source distributed SQL query engine for running analytic queries against

data sources of all sizes ranging from GBs to PBs. Presto was developed at Facebook, in 2012 and open-

sourced under Apache License V2 in 2013. Starburst, a company formed by the leading contributors to the

project, offers enterprise support for Presto.

Presto can connect to various data sources with its Connectors. We’ll take a special look at the Hive

connector, which lets Presto talk to Dell EMC ECS server. With Presto querying data from ECS in your

Private cloud — you have a secure and efficient data storage & processing pipeline.

1.2.1 Use cases With Presto running queries on Dell EMC ECS server, multiple query users consume the same data from

ECS, while applications producing data write to ECS. This leads to effective separation of compute and

storage layers and hence flexible scaling options.

Page 6: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution overview

6 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

1.2.2 Presto Hive(S3) connector

1.2.2.1 Overview The Hive connector allows Presto to query data stored in S3-compatible engines and registered in a Hive

Metastore. The Hive connector DOES NOT actually use Hive to parse or execute the SQL query in any

way. It only uses the Hive Metastore (aka HCatalog) for metadata behind the scenes. That allows users to

share the data with other engines such as Hive, Spark, Pig etc.

The Hive connector can be used to provide ANSI SQL analytics of data stored in S3-compatible engines

alone or to join data between different systems such as S3, MySQL, and Cassandra. In addition to SELECT

queries, DML and DDL statements such as CREATE/DROP, SCHEMA/TABLE, and INSERT INTO are also

supported by the Hive/S3 connector. Finally, users can display schemas, tables, views, and columns from the

Hive Metastore registered in Presto.

1.2.2.2 Pushdown

Presto can query several popular file formats such as ORC, Parquet, RCFile, AVRO, SequenceFile, and Text.

The ability to pushdown column projections and filters depends on the particular file format. For example, in

the case of ORC and Parquet, Presto will read only the columns needed by the query and leverage the built-

in min/max indices to skip reading files and blocks that do not contain rows with values that would satisfy the

filters. The pushdown capabilities improve overall query performance by reducing network I/O between Presto

clusters and S3.

1.2.2.3 Parallelism

Data transfer between a Presto cluster and S3- compatible engines is fully parallelized. Presto splits are

created out of data file segments and scheduled across multiple Presto Workers. All Presto Workers open

parallel connections to S3- compatible engines and fetch the columns and rows relevant to a given query from

the assigned splits.

1.2.2.4 Concurrency

Multiple concurrent Presto queries may reference S3-based tables. Each query parallelizes its data transfer

as explained above.

1.2.2.5 Use cases

The key use case for the Hive/S3 connector is to enable fully-featured ANSI SQL analytics of data stored in

an S3-compatible object store.

In addition, the historical data from S3-compatible object store can be combined in a single query with the

online data from Cassandra and with the data from a relational DBMS such as MySQL

Page 7: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution architecture

7 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

2 Solution architecture Figure 1 shows the solution design to implement on-prem interactive SQL query engine on Dell EMC ECS

object store.

Dell EMC ECS with Presto Solution design

Page 8: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution implementation

8 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

3 Solution implementation We will divide the solution implementation content into following subsections.

1. Dell EMC ECS setup

2. Configuring Presto

3. Configuring Hive Metastore as Presto Catalog

4. Solution validation

3.1 Dell EMC ECS setup Assuming ECS cluster is setup with storage pools, virtual data center and replication groups configured, we

will create buckets and required access for presto cluster.

3.1.1 Create new namespace Login into ECS cluster and under Manage, click on Namespace to create a new Namespace, fill in name and

replication group as shown below.

ECS Create Namespace

3.1.2 Create users and update password Now create a new user and generate secret key as shown below.

Page 9: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution implementation

9 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

ECS Create user and secret key

3.1.3 Create buckets Bucket management, create new bucket for the presto namespace created above and assign new user presto

created as the owner, as shown below.

Page 10: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution implementation

10 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

ECS Create bucket

3.2 Configuring Presto Presto installation steps are explained on the documentation page. Follow these steps and create the relevant

config files.

Create an etc directory inside the installation directory. This will hold the following configuration:

1. Node Properties: environmental configuration specific to each node

2. JVM Config: command line options for the Java Virtual Machine

3. Config Properties: configuration for the Presto server

4. Catalog Properties: configuration for Connectors (data sources)

Page 11: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution implementation

11 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

Presto configuration file

Following section will cover the etc/catalog configurations, which are basically the different data source

connectors to the Presto cluster.

3.3 Configuring Hive Metastore as a Presto catalog. We have left our catalog directory empty for now. That’s the place where Presto will need all the connectors.

We can configure a Hive connector and connect to ECS to use the Hive Metastore, Hive needs to be running

and the Metastore service to be started. This adds overhead of maintaining Hadoop cluster and RDBMS

outside ECS object store.

Create etc/catalog/hive_metastore.properties with the following contents to mount

the hive-hadoop2 connector as the hive_metastore catalog

Page 12: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution implementation

12 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

Presto hive connector for hive RDBMS metastore

Page 13: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution validation

13 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

4 Solution validation Validation is the important part of this solution to justify Presto support with Dell EMC ECS in building on-

premise interactive SQL query engine. We will follow below process to validate the installation, configuration

and functional testing.

4.1 Solution configuration verification

4.1.1 Hive Datawarehouse setup (validation purpose only) Assuming we have a Hadoop cluster with Hive setup, table data residing on ECS buckets accessed through

S3A protocol. Verify existing Hive Datawarehouse from Beeline client and verify the data is on ECS bucket

from ‘hive_ecs_bucket’ database structure.

1. From Hadoop edge node connect to the Hive Datawarehouse using beeline line client

2. List the existing databases

Page 14: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution validation

14 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

3. Describe the “hive_ecs_bucket” database schema to show the location of the data is on ECS storage.

4. List the tables under the hive_ecs_bucket database.

4.1.2 Presto server setup Verify the Presto Server start, and Hive Metastore connectors configurations are loaded as expected.

1. Start the Presto Server.

Page 15: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution validation

15 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

2. Verify Hive Metastore connector properties are loaded.

3. Verify Presto client can connect to the Presto Server, and can view catalog ‘hive_metestore’

a. hive_metastore fetches schema from RDMBs and data from ECS bucket through s3a

protocol.

4. From Presto client view all schemas and tables under ‘hive_metestore’ catalogue, this must be same

as checked from Hive Beeline client.

Page 16: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution validation

16 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

5. Run sample query to verify data stored on Hive Datawarehouse can be access through Presto

Server.

4.1.3 Presto interactive SQL query engines validation Create new schema “tpcds” on the hive connector and verify it from S3 browser client on the ECS bucket.

1. To confirm the SQL query engine let us run some sample queries of TPD-DS benchmark tool kit. For

this let us create a new bucket named “presto_ecs_bucket” on the ECS.

Page 17: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution validation

17 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

2. From Presto client create new schema called “tpcds” on “hive_metastore” catalog for the new bucket

“presto_ecs_bucket”.

3. Presto comes with a tpcds connector, let us enable it and restart the presto server.

4. From Presto client create new table using CTAS (Create Table As Select) to create all TPCDS

benchmark tables with data into the new schema ‘tpcds’ which is pointing to ‘presto_ecs_bucket on

ECS storage.

CREATE TABLE hive_metastore.tpcds.call_center WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.call_center

CREATE TABLE hive_metastore.tpcds.catalog_page WITH (format = 'ORC') AS SELECT

* FROM tpcds.sf1.catalog_page

CREATE TABLE hive_metastore.tpcds.catalog_returns WITH (format = 'ORC') AS

SELECT * FROM tpcds.sf1.catalog_returns

CREATE TABLE hive_metastore.tpcds.catalog_sales WITH (format = 'ORC') AS SELECT

* FROM tpcds.sf1.catalog_sales

CREATE TABLE hive_metastore.tpcds.customer WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.customer

CREATE TABLE hive_metastore.tpcds.customer_address WITH (format = 'ORC') AS

SELECT * FROM tpcds.sf1.customer_address

Page 18: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution validation

18 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

CREATE TABLE hive_metastore.tpcds.customer_demographics WITH (format = 'ORC') AS

SELECT * FROM tpcds.sf1.customer_demographics

CREATE TABLE hive_metastore.tpcds.date_dim WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.date_dim

CREATE TABLE hive_metastore.tpcds.household_demographics WITH (format = 'ORC')

AS SELECT * FROM tpcds.sf1.household_demographics

CREATE TABLE hive_metastore.tpcds.income_band WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.income_band

CREATE TABLE hive_metastore.tpcds.inventory WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.inventory

CREATE TABLE hive_metastore.tpcds.item WITH (format = 'ORC') AS SELECT * FROM

tpcds.sf1.item

CREATE TABLE hive_metastore.tpcds.promotion WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.promotion

CREATE TABLE hive_metastore.tpcds.reason WITH (format = 'ORC') AS SELECT * FROM

tpcds.sf1.reason

CREATE TABLE hive_metastore.tpcds.ship_mode WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.ship_mode

CREATE TABLE hive_metastore.tpcds.store WITH (format = 'ORC') AS SELECT * FROM

tpcds.sf1.store

CREATE TABLE hive_metastore.tpcds.store_returns WITH (format = 'ORC') AS SELECT

* FROM tpcds.sf1.store_returns

CREATE TABLE hive_metastore.tpcds.store_sales WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.store_sales

CREATE TABLE hive_metastore.tpcds.time_dim WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.time_dim

CREATE TABLE hive_metastore.tpcds.warehouse WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.warehouse

CREATE TABLE hive_metastore.tpcds.web_page WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.web_page

CREATE TABLE hive_metastore.tpcds.web_returns WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.web_returns

CREATE TABLE hive_metastore.tpcds.web_sales WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.web_sales

CREATE TABLE hive_metastore.tpcds.web_site WITH (format = 'ORC') AS SELECT *

FROM tpcds.sf1.web_site

5. Verify all the 22 tables are created successfully from the presto client.

Page 19: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution validation

19 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

6. From s3browser verify all 22 tables directories are created and loaded with necessary data from

tpcds connector after running CTAS command.

Page 20: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Solution validation

20 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

7. Now let us verify the interactive SQL engine capability of the presto by running sample TPCDS SQL

queries.

To summarize the above validation steps, we verified presto installation, configuration and service startup.

Connected to the Presto Server and with hive connector we verified presto can operate on already existing Hive

Datawarehouse. To test the interactive SQL query engine, we enabled TPCDS connector and created all 22

tables onto the schema with location on ECS bucket. Successfully ran SQL queries selected randomly from the

TPCDS 99 queries list.

Page 21: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Conclusion

21 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

5 Conclusion As enterprises move to object storage based private clouds that integrate with applications directly, it makes

sense for the analysis technology to plugin to the object storage as well. Not only does this make the process

efficient, it is easy to maintain as well. In this solution guide we showed the solution to empower enterprises to

begin Interactive SQL, Machine Learning, Deep Learning and/or Artificial Intelligence workloads on Dell EMC

ECS powered by Presto.

The overall takeaway for enterprises is that Presto’s capabilities translate remarkably well on to the Dell EMC

ECS object store. The result is that Dell EMC ECS can and should be a foundational component of the

enterprise data stack. Its superior economics, infinite scalability and rich enterprise feature set means that it

wins the price/performance battle against Hadoop and other technologies in most cases.

Page 22: Dell EMC ECS: Interactive SQL Query Engine with Presto€¦ · The enterprise wants to co-optimize for scalability, maintainability, security and cost. This document is designed to

Technical support and resources

22 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1

A Technical support and resources

Dell.com/support is focused on meeting customer needs with proven services and support.

Storage technical documents and videos provide expertise that helps to ensure customer success on Dell

EMC storage platforms.

A.1 Related resources

Provide a list of documents and other assets that are referenced in the paper; include other resources that

may be helpful.

• Hortonworks HDP3.1

• Presto Hive(S3) Connector

• ECS Overview and Architecture

• Dell EMC ECS Object Store

• Presto