Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
De
H18207.1
Deployment and Configuration
Dell EMC ECS: Interactive SQL Query Engine with Presto Building on-premise SQL ecosystem with Dell EMC ECS powered by Presto
Abstract This paper describes official support and technical solution for building distributed
and interactive SQL query engine with Dell EMC ECS object store powered by
Presto. To empower enterprises to begin Interactive SQL, Machine Learning,
Deep Learning and Artificial Intelligence with a path towards future growth.
March 2020
Revisions
2 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
Revisions
Date Description
March 2020 Initial release
March 2020 Update to Dell Technologies template, removed S3Select scope, add TPCDS example.
Acknowledgements
This paper was produced by the following:
Author: Kirankumar Bhusanurmath, Analytics Solutions Architect.
The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this
publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any software described in this publication requires an applicable software license.
Copyright © 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell
Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [3/26/2020] [Deployment and Configuration] [H18207.1]
Table of contents
3 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
Table of contents
Revisions ..................................................................................................................................................................... 2
Acknowledgements ...................................................................................................................................................... 2
Table of contents ......................................................................................................................................................... 3
Executive summary ...................................................................................................................................................... 4
1 Solution overview .................................................................................................................................................. 5
1.1 Dell EMC ECS.............................................................................................................................................. 5
1.2 Presto .......................................................................................................................................................... 5
1.2.1 Use cases .................................................................................................................................................... 5
1.2.2 Presto Hive(S3) connector............................................................................................................................ 6
2 Solution architecture .............................................................................................................................................. 7
3 Solution implementation ......................................................................................................................................... 8
3.1 Dell EMC ECS setup .................................................................................................................................... 8
3.1.1 Create new namespace ................................................................................................................................ 8
3.1.2 Create users and update password .............................................................................................................. 8
3.1.3 Create buckets ............................................................................................................................................. 9
3.2 Configuring Presto...................................................................................................................................... 10
3.3 Configuring Hive Metastore as a Presto catalog.......................................................................................... 11
4 Solution validation................................................................................................................................................ 13
4.1 Solution configuration verification ............................................................................................................... 13
4.1.1 Hive Datawarehouse setup (validation purpose only) .................................................................................. 13
4.1.2 Presto server setup .................................................................................................................................... 14
4.1.3 Presto interactive SQL query engines validation ......................................................................................... 16
5 Conclusion........................................................................................................................................................... 21
A Technical support and resources ......................................................................................................................... 22
A.1 Related resources ...................................................................................................................................... 22
Executive summary
4 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
Executive summary
One of the key challenges in any digitization journey is the adoption of new analytics techniques. Given the
explosion of tools and frameworks, it can be difficult to know where to start and what choices preclude other
choices down the road. The enterprise wants to co-optimize for scalability, maintainability, security and cost.
This document is designed to empower enterprises to begin using Interactive SQL, Machine Learning, Deep
Learning and/or Artificial Intelligence with a path toward future growth. As you read through this paper, we will
do some code development, some integration work, and some configuration to build the on-premise
interactive SQL engine on Dell EMC ECS powered by Presto.
The overall takeaway for enterprises is that Presto’s capabilities translate remarkably well on to the Dell EMC
ECS object storage. The result is that Dell EMC ECS can and should be a foundational component of the
enterprise data stack. Its superior economics, infinite scalability and rich enterprise feature set means that it
wins the price/performance battle against Hadoop and other technologies in most cases.
Solution overview
5 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
1 Solution overview
1.1 Dell EMC ECS Dell Technologies offers the ECS appliance for cloud object storage. The appliance is available as either a
turnkey, fully integrated hardware appliance, or as a software appliance that can run on commodity or third-
party hardware. Administrators can customize and configure the hardware appliance depending on business
needs and expected growth. The software appliance is available free of charge for non-production
environments.
To access object storage on the ECS system, Dell EMC has built in support for various APIs, such as:
• Amazon S3
• Openstack swift
• Dell EMC Atmos
• Dell EMC Centera CAS
Along with the array of supported APIs used to interact with object storage, ECS also provides support for
various storage protocols such as HDFS and NFS. Providing multiple protocols can help hardware resources
while keeping support for legacy applications that might need NFS.
As Dell EMC ECS is available in multiple configurations depending on the business need and necessary
capacity, the cost can also be structured based on the available budget. Dell EMC Financial Services can
provide flexible options for the procurement of Dell EMC hardware, including financing. Unlike many public
cloud solutions, Dell EMC ECS doesn’t charge monthly per GB, HTTP GET request, or data egress, helping
to limit ‘hidden’ or unexpected costs.
Administrators can deploy the ECS solution anywhere in the world, as the only requirement is space in the
datacenter. Business can deploy the ECS appliance directly or to service providers to provide object storage
to clients. When businesses deploy ECS on their premise, they have absolute control over where data
resides.
1.2 Presto Presto is a fast, interactive, open source distributed SQL query engine for running analytic queries against
data sources of all sizes ranging from GBs to PBs. Presto was developed at Facebook, in 2012 and open-
sourced under Apache License V2 in 2013. Starburst, a company formed by the leading contributors to the
project, offers enterprise support for Presto.
Presto can connect to various data sources with its Connectors. We’ll take a special look at the Hive
connector, which lets Presto talk to Dell EMC ECS server. With Presto querying data from ECS in your
Private cloud — you have a secure and efficient data storage & processing pipeline.
1.2.1 Use cases With Presto running queries on Dell EMC ECS server, multiple query users consume the same data from
ECS, while applications producing data write to ECS. This leads to effective separation of compute and
storage layers and hence flexible scaling options.
Solution overview
6 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
1.2.2 Presto Hive(S3) connector
1.2.2.1 Overview The Hive connector allows Presto to query data stored in S3-compatible engines and registered in a Hive
Metastore. The Hive connector DOES NOT actually use Hive to parse or execute the SQL query in any
way. It only uses the Hive Metastore (aka HCatalog) for metadata behind the scenes. That allows users to
share the data with other engines such as Hive, Spark, Pig etc.
The Hive connector can be used to provide ANSI SQL analytics of data stored in S3-compatible engines
alone or to join data between different systems such as S3, MySQL, and Cassandra. In addition to SELECT
queries, DML and DDL statements such as CREATE/DROP, SCHEMA/TABLE, and INSERT INTO are also
supported by the Hive/S3 connector. Finally, users can display schemas, tables, views, and columns from the
Hive Metastore registered in Presto.
1.2.2.2 Pushdown
Presto can query several popular file formats such as ORC, Parquet, RCFile, AVRO, SequenceFile, and Text.
The ability to pushdown column projections and filters depends on the particular file format. For example, in
the case of ORC and Parquet, Presto will read only the columns needed by the query and leverage the built-
in min/max indices to skip reading files and blocks that do not contain rows with values that would satisfy the
filters. The pushdown capabilities improve overall query performance by reducing network I/O between Presto
clusters and S3.
1.2.2.3 Parallelism
Data transfer between a Presto cluster and S3- compatible engines is fully parallelized. Presto splits are
created out of data file segments and scheduled across multiple Presto Workers. All Presto Workers open
parallel connections to S3- compatible engines and fetch the columns and rows relevant to a given query from
the assigned splits.
1.2.2.4 Concurrency
Multiple concurrent Presto queries may reference S3-based tables. Each query parallelizes its data transfer
as explained above.
1.2.2.5 Use cases
The key use case for the Hive/S3 connector is to enable fully-featured ANSI SQL analytics of data stored in
an S3-compatible object store.
In addition, the historical data from S3-compatible object store can be combined in a single query with the
online data from Cassandra and with the data from a relational DBMS such as MySQL
Solution architecture
7 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
2 Solution architecture Figure 1 shows the solution design to implement on-prem interactive SQL query engine on Dell EMC ECS
object store.
Dell EMC ECS with Presto Solution design
Solution implementation
8 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
3 Solution implementation We will divide the solution implementation content into following subsections.
1. Dell EMC ECS setup
2. Configuring Presto
3. Configuring Hive Metastore as Presto Catalog
4. Solution validation
3.1 Dell EMC ECS setup Assuming ECS cluster is setup with storage pools, virtual data center and replication groups configured, we
will create buckets and required access for presto cluster.
3.1.1 Create new namespace Login into ECS cluster and under Manage, click on Namespace to create a new Namespace, fill in name and
replication group as shown below.
ECS Create Namespace
3.1.2 Create users and update password Now create a new user and generate secret key as shown below.
Solution implementation
9 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
ECS Create user and secret key
3.1.3 Create buckets Bucket management, create new bucket for the presto namespace created above and assign new user presto
created as the owner, as shown below.
Solution implementation
10 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
ECS Create bucket
3.2 Configuring Presto Presto installation steps are explained on the documentation page. Follow these steps and create the relevant
config files.
Create an etc directory inside the installation directory. This will hold the following configuration:
1. Node Properties: environmental configuration specific to each node
2. JVM Config: command line options for the Java Virtual Machine
3. Config Properties: configuration for the Presto server
4. Catalog Properties: configuration for Connectors (data sources)
Solution implementation
11 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
Presto configuration file
Following section will cover the etc/catalog configurations, which are basically the different data source
connectors to the Presto cluster.
3.3 Configuring Hive Metastore as a Presto catalog. We have left our catalog directory empty for now. That’s the place where Presto will need all the connectors.
We can configure a Hive connector and connect to ECS to use the Hive Metastore, Hive needs to be running
and the Metastore service to be started. This adds overhead of maintaining Hadoop cluster and RDBMS
outside ECS object store.
Create etc/catalog/hive_metastore.properties with the following contents to mount
the hive-hadoop2 connector as the hive_metastore catalog
Solution implementation
12 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
Presto hive connector for hive RDBMS metastore
Solution validation
13 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
4 Solution validation Validation is the important part of this solution to justify Presto support with Dell EMC ECS in building on-
premise interactive SQL query engine. We will follow below process to validate the installation, configuration
and functional testing.
4.1 Solution configuration verification
4.1.1 Hive Datawarehouse setup (validation purpose only) Assuming we have a Hadoop cluster with Hive setup, table data residing on ECS buckets accessed through
S3A protocol. Verify existing Hive Datawarehouse from Beeline client and verify the data is on ECS bucket
from ‘hive_ecs_bucket’ database structure.
1. From Hadoop edge node connect to the Hive Datawarehouse using beeline line client
2. List the existing databases
Solution validation
14 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
3. Describe the “hive_ecs_bucket” database schema to show the location of the data is on ECS storage.
4. List the tables under the hive_ecs_bucket database.
4.1.2 Presto server setup Verify the Presto Server start, and Hive Metastore connectors configurations are loaded as expected.
1. Start the Presto Server.
Solution validation
15 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
2. Verify Hive Metastore connector properties are loaded.
3. Verify Presto client can connect to the Presto Server, and can view catalog ‘hive_metestore’
a. hive_metastore fetches schema from RDMBs and data from ECS bucket through s3a
protocol.
4. From Presto client view all schemas and tables under ‘hive_metestore’ catalogue, this must be same
as checked from Hive Beeline client.
Solution validation
16 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
5. Run sample query to verify data stored on Hive Datawarehouse can be access through Presto
Server.
4.1.3 Presto interactive SQL query engines validation Create new schema “tpcds” on the hive connector and verify it from S3 browser client on the ECS bucket.
1. To confirm the SQL query engine let us run some sample queries of TPD-DS benchmark tool kit. For
this let us create a new bucket named “presto_ecs_bucket” on the ECS.
Solution validation
17 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
2. From Presto client create new schema called “tpcds” on “hive_metastore” catalog for the new bucket
“presto_ecs_bucket”.
3. Presto comes with a tpcds connector, let us enable it and restart the presto server.
4. From Presto client create new table using CTAS (Create Table As Select) to create all TPCDS
benchmark tables with data into the new schema ‘tpcds’ which is pointing to ‘presto_ecs_bucket on
ECS storage.
CREATE TABLE hive_metastore.tpcds.call_center WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.call_center
CREATE TABLE hive_metastore.tpcds.catalog_page WITH (format = 'ORC') AS SELECT
* FROM tpcds.sf1.catalog_page
CREATE TABLE hive_metastore.tpcds.catalog_returns WITH (format = 'ORC') AS
SELECT * FROM tpcds.sf1.catalog_returns
CREATE TABLE hive_metastore.tpcds.catalog_sales WITH (format = 'ORC') AS SELECT
* FROM tpcds.sf1.catalog_sales
CREATE TABLE hive_metastore.tpcds.customer WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.customer
CREATE TABLE hive_metastore.tpcds.customer_address WITH (format = 'ORC') AS
SELECT * FROM tpcds.sf1.customer_address
Solution validation
18 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
CREATE TABLE hive_metastore.tpcds.customer_demographics WITH (format = 'ORC') AS
SELECT * FROM tpcds.sf1.customer_demographics
CREATE TABLE hive_metastore.tpcds.date_dim WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.date_dim
CREATE TABLE hive_metastore.tpcds.household_demographics WITH (format = 'ORC')
AS SELECT * FROM tpcds.sf1.household_demographics
CREATE TABLE hive_metastore.tpcds.income_band WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.income_band
CREATE TABLE hive_metastore.tpcds.inventory WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.inventory
CREATE TABLE hive_metastore.tpcds.item WITH (format = 'ORC') AS SELECT * FROM
tpcds.sf1.item
CREATE TABLE hive_metastore.tpcds.promotion WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.promotion
CREATE TABLE hive_metastore.tpcds.reason WITH (format = 'ORC') AS SELECT * FROM
tpcds.sf1.reason
CREATE TABLE hive_metastore.tpcds.ship_mode WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.ship_mode
CREATE TABLE hive_metastore.tpcds.store WITH (format = 'ORC') AS SELECT * FROM
tpcds.sf1.store
CREATE TABLE hive_metastore.tpcds.store_returns WITH (format = 'ORC') AS SELECT
* FROM tpcds.sf1.store_returns
CREATE TABLE hive_metastore.tpcds.store_sales WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.store_sales
CREATE TABLE hive_metastore.tpcds.time_dim WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.time_dim
CREATE TABLE hive_metastore.tpcds.warehouse WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.warehouse
CREATE TABLE hive_metastore.tpcds.web_page WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.web_page
CREATE TABLE hive_metastore.tpcds.web_returns WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.web_returns
CREATE TABLE hive_metastore.tpcds.web_sales WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.web_sales
CREATE TABLE hive_metastore.tpcds.web_site WITH (format = 'ORC') AS SELECT *
FROM tpcds.sf1.web_site
5. Verify all the 22 tables are created successfully from the presto client.
Solution validation
19 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
6. From s3browser verify all 22 tables directories are created and loaded with necessary data from
tpcds connector after running CTAS command.
Solution validation
20 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
7. Now let us verify the interactive SQL engine capability of the presto by running sample TPCDS SQL
queries.
To summarize the above validation steps, we verified presto installation, configuration and service startup.
Connected to the Presto Server and with hive connector we verified presto can operate on already existing Hive
Datawarehouse. To test the interactive SQL query engine, we enabled TPCDS connector and created all 22
tables onto the schema with location on ECS bucket. Successfully ran SQL queries selected randomly from the
TPCDS 99 queries list.
Conclusion
21 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
5 Conclusion As enterprises move to object storage based private clouds that integrate with applications directly, it makes
sense for the analysis technology to plugin to the object storage as well. Not only does this make the process
efficient, it is easy to maintain as well. In this solution guide we showed the solution to empower enterprises to
begin Interactive SQL, Machine Learning, Deep Learning and/or Artificial Intelligence workloads on Dell EMC
ECS powered by Presto.
The overall takeaway for enterprises is that Presto’s capabilities translate remarkably well on to the Dell EMC
ECS object store. The result is that Dell EMC ECS can and should be a foundational component of the
enterprise data stack. Its superior economics, infinite scalability and rich enterprise feature set means that it
wins the price/performance battle against Hadoop and other technologies in most cases.
Technical support and resources
22 Dell EMC ECS: Interactive SQL Query Engine with Presto | H18207.1
A Technical support and resources
Dell.com/support is focused on meeting customer needs with proven services and support.
Storage technical documents and videos provide expertise that helps to ensure customer success on Dell
EMC storage platforms.
A.1 Related resources
Provide a list of documents and other assets that are referenced in the paper; include other resources that
may be helpful.
• Hortonworks HDP3.1
• Presto Hive(S3) Connector
• ECS Overview and Architecture
• Dell EMC ECS Object Store
• Presto