Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Cloud SQL Shootout: TPCH on Redshift and Hive Today's business analyst demands a SQLlike access to Big DataTM . Your task today is to design a SQLinthecloud data warehouse system. You compare Hive and AWS Redshift, which is a hosted version of the parallel database system Actian Matrix (probably better known under its previous name ParAccel). As a benchmark we are going to use TPCH, a industry standard benchmark for analytical SQL systems. http://www.tpc.org/tpch/ Three data sets with scale factors 1, 10 and 300 have already been created for you and uploaded to S3. TPCH scale factor 300 means the largest table "lineitem" has 1.799.989.091 records.

Preparation for Redshift Install SQL Workbench and the Redshift JDBC driver http://www.sqlworkbench.net/ Installation instructions: http://docs.aws.amazon.com/redshift/latest/mgmt/connectingusingworkbench.html Redshift JDBC driver can be downloaded from http://homepages.cwi.nl/~hannes/RedshiftJDBC411.1.7.1007.jar This time, you will need your AWS access credentials. Create a new access key: Go to "Security Credentials" in the console

http://www.tpc.org/tpch/

http://www.sql-workbench.net/

http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html

http://homepages.cwi.nl/~hannes/RedshiftJDBC41-1.1.7.1007.jar

Note down your access key id and secret access key somewhere where you will find it again. You can also simply download the key file which contains this information.

Redshift Startup Go to the AWS Redshift console and click the "Launch Cluster" Button

For now, we will create a singlenode cluster for testing purposes.

Make sure you note the username and password, you will need it later. Database name can remain empty.

Select a single node cluster for now

Additional configuration can all remain on defaults Then launch

Go to the clusters dashboard, you will see your cluster launching

Wait until the cluster is available

Click the cluster name ("badstest1" here) to show its details It might show a warning due to the firewall not allowing access to the JDBC port. Click the warning sign and then "Edit security group"

Edit the security group on the "inbound" tab to allow Redshift connections from anywhere:

Afterwards, the warning should be gone.

If connecting with SQL Workbench (see below) does not work either after above changes or although you don’t get the above warning about “No Inbound Permissions” in the first place please try the following in the EC2 console to openup access after all THANKS to Taco Wijnsma for the hint!

Then, use the displayed JDBC URL to connect using SQL Workbench.

Redshift Schema / Data loading Run the following query in the SQL Workbench to create and load the tables (a plain text file is available at http://homepages.cwi.nl/~manegold/UvAABSMBABDBAISfBD/RedshiftSchemaDataLoading.sql). CREATE TABLE region (r_regionkey INT NOT NULL, r_name VARCHAR(25) NOT NULL, r_comment VARCHAR(152) NOT NULL, PRIMARY KEY (r_regionkey)) ; CREATE TABLE nation (n_nationkey INT NOT NULL, n_name VARCHAR(25) NOT NULL, n_regionkey INT NOT NULL, n_comment VARCHAR(152) NOT NULL, PRIMARY KEY (n_nationkey)) ; CREATE TABLE supplier (s_suppkey INT NOT NULL, s_name VARCHAR(25) NOT NULL, s_address VARCHAR(40) NOT NULL, s_nationkey INT NOT NULL, s_phone VARCHAR(15) NOT NULL, s_acctbal DECIMAL(15,2) NOT NULL, s_comment VARCHAR(101) NOT NULL, PRIMARY KEY (s_suppkey)) ; CREATE TABLE customer (c_custkey INT NOT NULL, c_name VARCHAR(25) NOT NULL, c_address VARCHAR(40) NOT NULL, c_nationkey INT NOT NULL, c_phone VARCHAR(15) NOT NULL, c_acctbal DECIMAL(15,2) NOT NULL, c_mktsegment VARCHAR(10) NOT NULL, c_comment VARCHAR(117) NOT NULL, PRIMARY KEY (c_custkey)) ; CREATE TABLE part (p_partkey INT NOT NULL, p_name VARCHAR(55) NOT NULL, p_mfgr VARCHAR(25) NOT NULL, p_brand VARCHAR(10) NOT NULL, p_type VARCHAR(25) NOT NULL, p_size INT NOT NULL, p_container VARCHAR(10) NOT NULL, p_retailprice DECIMAL(15,2) NOT NULL, p_comment VARCHAR(23) NOT NULL, PRIMARY KEY (p_partkey)) ; CREATE TABLE partsupp (ps_partkey INT NOT NULL, ps_suppkey INT NOT NULL, ps_availqty INT NOT NULL, ps_supplycost DECIMAL(15,2) NOT NULL, ps_comment VARCHAR(199) NOT NULL, PRIMARY KEY (ps_partkey, ps_suppkey), FOREIGN KEY (ps_partkey) REFERENCES part (p_partkey), FOREIGN KEY (ps_suppkey) REFERENCES supplier (s_suppkey)) ; CREATE TABLE orders (o_orderkey INT NOT NULL, o_custkey INT NOT NULL, o_orderstatus VARCHAR(1) NOT NULL, o_totalprice DECIMAL(15,2) NOT NULL, o_orderdate DATE NOT NULL, o_orderpriority VARCHAR(15) NOT NULL, o_clerk VARCHAR(15) NOT NULL, o_shippriority INT NOT NULL, o_comment VARCHAR(79) NOT NULL, PRIMARY KEY (o_orderkey)) ; CREATE TABLE lineitem (l_orderkey INT NOT NULL, l_partkey INT NOT NULL, l_suppkey INT NOT NULL, l_linenumber INT NOT NULL, l_quantity INTEGER NOT NULL, l_extendedprice DECIMAL(15,2) NOT NULL, l_discount DECIMAL(15,2) NOT NULL, l_tax DECIMAL(15,2) NOT NULL, l_returnflag VARCHAR(1) NOT NULL, l_linestatus VARCHAR(1) NOT NULL, l_shipdate DATE NOT NULL, l_commitdate DATE NOT NULL, l_receiptdate DATE NOT NULL, l_shipinstruct VARCHAR(25) NOT NULL, l_shipmode VARCHAR(10) NOT NULL, l_comment VARCHAR(44) NOT NULL, PRIMARY KEY (l_orderkey,l_linenumber)) ; COMMIT;

http://homepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/RedshiftSchemaDataLoading.sql

http://homepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/RedshiftSchemaDataLoading.sql

In the remainder, replace XXXXX / YYYYY with your access key / secret access key! copy region from 's3://tpchbadsdata/sf1/region/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy nation from 's3://tpchbadsdata/sf1/nation/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy customer from 's3://tpchbadsdata/sf1/customer/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy orders from 's3://tpchbadsdata/sf1/orders/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy lineitem from 's3://tpchbadsdata/sf1/lineitem/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy part from 's3://tpchbadsdata/sf1/part/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy partsupp from 's3://tpchbadsdata/sf1/partsupp/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy supplier from 's3://tpchbadsdata/sf1/supplier/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; COMMIT;

Replace XXXXX / YYYYY with your access key / secret access key! You can observe your cluster working by going to the "Queries" Tab in the cluster details on the Redshift console.

Make sure the data is loaded by running a SELECT COUNT(*) FROM table for all the loaded tables after the COMMIT, i.e., (a plain text file is available at http://homepages.cwi.nl/~manegold/UvAABSMBABDBAISfBD/SELECT_COUNT.sql) SELECT count(*) from region; SELECT count(*) from nation; SELECT count(*) from supplier; SELECT count(*) from customer; SELECT count(*) from part; SELECT count(*) from partsupp; SELECT count(*) from orders; SELECT count(*) from lineitem; Run the following queries and note their runtime.

Redshift TPCH Query 1 (a plain text file is available at http://homepages.cwi.nl/~manegold/UvAABSMBABDBAISfBD/RedshiftTPCHQuery1.sql)

select l_returnflag,

http://homepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/SELECT_COUNT.sql

http://homepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Redshift-TPCH-Query1.sql

l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 l_discount)) as sum_disc_price, sum(l_extendedprice * (1 l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order

from lineitem

where l_shipdate <= date '19981201' interval '108' day

group by l_returnflag, l_linestatus

order by l_returnflag, l_linestatus;

Redshift TPCH Query 5 (a plain text file is available at http://homepages.cwi.nl/~manegold/UvAABSMBABDBAISfBD/RedshiftTPCHQuery5.sql) select

n_name, sum(l_extendedprice * (1 l_discount)) as revenue

from customer, orders, lineitem, supplier, nation, region

where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'MIDDLE EAST'

group by n_name

order by revenue desc;

http://homepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Redshift-TPCH-Query5.sql

Redshift Shutting Down After done with your queries, shut down your cluster

Hive Schema / Data loading Startup EMR cluster (all defaults, 2 nodes, remember Hue security group)

Access Hue's Hive query editor

Set S3 Credentials for Hive

Set first Key to fs.s3n.awsAccessKeyId, first value your S3 access key Set second Value to fs.s3n.awsSecretAccessKey, second value to your S3 secret key Create and load tables (not really loading…) (a plain text file is available at

http://homepages.cwi.nl/~manegold/UvAABSMBABDBAISfBD/HivesSchemaDataLoading.sql) CREATE EXTERNAL TABLE customer( C_CustKey int , C_Name varchar(64) , C_Address varchar(64) , C_NationKey int , C_Phone varchar(64) , C_AcctBal decimal(13, 2) , C_MktSegment varchar(64) , C_Comment varchar(120) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpchbadsdata/sf1/customer/'; CREATE EXTERNAL TABLE lineitem( L_OrderKey int , L_PartKey int , L_SuppKey int , L_LineNumber int , L_Quantity int , L_ExtendedPrice decimal(13, 2) , L_Discount decimal(13, 2) , L_Tax decimal(13, 2) , L_ReturnFlag varchar(64) , L_LineStatus varchar(64) , L_ShipDate date , L_CommitDate date , L_ReceiptDate date , L_ShipInstruct varchar(64) , L_ShipMode varchar(64) , L_Comment varchar(64) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpchbadsdata/sf1/lineitem/'; CREATE EXTERNAL TABLE nation( N_NationKey int , N_Name varchar(64) , N_RegionKey int , N_Comment varchar(160) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpchbadsdata/sf1/nation/'; CREATE EXTERNAL TABLE orders( O_OrderKey int , O_CustKey int ,

http://homepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/HivesSchemaDataLoading.sql

O_OrderStatus varchar(64) , O_TotalPrice decimal(13, 2) , O_OrderDate date , O_OrderPriority varchar(15) , O_Clerk varchar(64) , O_ShipPriority int , O_Comment varchar(80) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpchbadsdata/sf1/orders/'; CREATE EXTERNAL TABLE part( P_PartKey int , P_Name varchar(64) , P_Mfgr varchar(64) , P_Brand varchar(64) , P_Type varchar(64) , P_Size int , P_Container varchar(64) , P_RetailPrice decimal(13, 2) , P_Comment varchar(64) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpchbadsdata/sf1/part/'; CREATE EXTERNAL TABLE partsupp( PS_PartKey int , PS_SuppKey int , PS_AvailQty int , PS_SupplyCost decimal(13, 2) , PS_Comment varchar(200) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpchbadsdata/sf1/partsupp/'; CREATE EXTERNAL TABLE region( R_RegionKey int , R_Name varchar(64) , R_Comment varchar(160) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpchbadsdata/sf1/region/'; CREATE EXTERNAL TABLE supplier( S_SuppKey int , S_Name varchar(64) , S_Address varchar(64) , S_NationKey int , S_Phone varchar(18) , S_AcctBal decimal(13, 2) ,

S_Comment varchar(105) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpchbadsdata/sf1/supplier/'; Again, you might want to make sure the data is loaded by running a SELECT COUNT(*) FROM table for all the loaded tables after the COMMIT, i.e., (a plain text file is available at http://homepages.cwi.nl/~manegold/UvAABSMBABDBAISfBD/SELECT_COUNT.sql) SELECT count(*) from region; SELECT count(*) from nation; SELECT count(*) from supplier; SELECT count(*) from customer; SELECT count(*) from part; SELECT count(*) from partsupp; SELECT count(*) from orders; SELECT count(*) from lineitem; Please note whether and if so how the execution times for running the SELECT COUNT(*) FROM table queries differ between RedShift & Hive. If they do, any idea why?

Hive TPCH Query 1 (a plain text file is available at http://homepages.cwi.nl/~manegold/UvAABSMBABDBAISfBD/HivesTPCHQuery1.sql) SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(1L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(1L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), COUNT(1) FROM lineitem WHERE L_SHIPDATE<='19980902' GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG, L_LINESTATUS;

Hive TPCH Query 5 (a plain text file is available at http://homepages.cwi.nl/~manegold/UvAABSMBABDBAISfBD/HivesTPCHQuery5.sql) select n_name, sum(l_extendedprice * (1 l_discount)) as revenue

http://homepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/SELECT_COUNT.sql

http://homepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hives-TPCH-Query1.sql

http://homepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hives-TPCH-Query5.sql

from customer c join ( select n_name, l_extendedprice, l_discount, s_nationkey, o_custkey from orders o join ( select n_name, l_extendedprice, l_discount, l_orderkey, s_nationkey from lineitem l join ( select n_name, s_suppkey, s_nationkey from supplier s join ( select n_name, n_nationkey from nation n join region r on n.n_regionkey = r.r_regionkey and r.r_name = 'MIDDLE EAST' ) n1 on s.s_nationkey = n1.n_nationkey ) s1 on l.l_suppkey = s1.s_suppkey ) l1 on l1.l_orderkey = o.o_orderkey ) o1 on c.c_nationkey = o1.s_nationkey and c.c_custkey = o1.o_custkey group by n_name order by revenue desc;

(Noncloud) alternative: MonetDB In case one of the above cloud approaches does not work for you in due time (report why not!), or in case you are just curious, you can use MonetDB (http://www.monetdb.org/) (on your own laptop/desktop/workstation!) as an alternative. See the MonetDB website at https://www.monetdb.org/Downloads and https://www.monetdb.org/Documentation/Guide/Installation for instructions how to download and install the latest release of MonetDB (Jul2015SP4) on your system, https://www.monetdb.org/Documentation/UserGuide/Tutorial for a general tutorial how to use it and https://www.monetdb.org/Documentation/Cookbooks/SQLrecipies/Clients/SQLWorkbench for documentation how to connect SQL Workbench (that we used with RedShift above) to MonetDB (however, you do not necessarily need to use SQL Workbench or any other graphcal client interface; simply using the MonetDBprovided textual / concolebased `mclient` is enought for this exercise). You find TPCH data as compressed CSV files for scale factors 1, 3, 10 (i.e., sizes 1GB, 3GB, 10GB) at http://homepages.cwi.nl/~manegold/TPCH/ . Please download all files for the scale factor(s) you want to try into a directory (use separate directory per scalefactor) on your machine (and recall the entire absolute path to that directory). There is no need to unpack / decompress these files, as MonetDB can bulkload data directly from compressed CSV files. Then start the MonetDB server (mserver) as per the instructions on the MonetDB website (see above for links). At http://homepages.cwi.nl/~manegold/TPCH/ you also find the SQL scripts for MonetDB to create the database schema (tables), load the data (NOTE: in load_data.sql you need to replace “_MyDataPath_” with the entire absolute path to the directory where you downloaded the data files to!), run TPCH queries 1 & 5, create foreign keys, and drop the tables, again. Run these scripts (or their content) via mclient, SQL Workbench, or you favorite SQL client.

http://www.monetdb.org/

https://www.monetdb.org/Downloads

https://www.monetdb.org/Documentation/Guide/Installation

https://www.monetdb.org/Documentation/UserGuide/Tutorial

https://www.monetdb.org/Documentation/Cookbooks/SQLrecipies/Clients/SQLWorkbench

http://homepages.cwi.nl/~manegold/TPC-H/

http://homepages.cwi.nl/~manegold/TPC-H/

Tasks: Load the datasets and run the queries at least three times. Note the time it takes

Hive/Redshift to complete Query 1 and 5 separately. Are the results the same? If not, why not?

Run the queries on a 10node cluster and the Scale factor 300 data in prefix s3://tpchbadsdata/sf300/

Hint: Look at the loads/queries tab in the Redshift console to monitor progress. Hint: A mediumsize data set is available in

s3://tpchbadsdata/sf10/ , this can help seeing performance differences on a twonode cluster.

Visualize runtimes and try to explain why they are different (if they are) Include initial (and perhaps optimized) query plans in your report Bonus: Using the schema and the queries, design and explain a partitioning scheme for

Hive and Redshift that optimizes the runtime of the two queries. Run the queries again and measure. Check query plans to look for the impact of partitioning changes.

Redshift partitoning guide: http://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html

Hive partitioning guide: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Hint: Look at individual joins, and the size of the tables involved. Hint: For Hive, consider moving the data from S3 into HDFS using the CREATE

TABLE AS SELECT … method. Multiple schemas can help here.

Hint: Foreign keys for TPCH (a plain text file is available at http://homepages.cwi.nl/~manegold/UvAABSMBABDBAISfBD/ForeignKeysforTPCH.sql) ALTER TABLE nation ADD CONSTRAINT nation_regionkey FOREIGN KEY (n_regionkey) REFERENCES region (r_regionkey); ALTER TABLE supplier ADD CONSTRAINT supplier_nationkey FOREIGN KEY (s_nationkey) REFERENCES nation (n_nationkey); ALTER TABLE customer ADD CONSTRAINT customer_nationkey FOREIGN KEY (c_nationkey) REFERENCES nation (n_nationkey); ALTER TABLE partsupp ADD CONSTRAINT partsupp_partkey FOREIGN KEY (ps_partkey) REFERENCES part (p_partkey); ALTER TABLE partsupp ADD CONSTRAINT partsupp_suppkey FOREIGN KEY (ps_suppkey) REFERENCES supplier (s_suppkey); ALTER TABLE orders ADD CONSTRAINT order_custkey FOREIGN KEY (o_custkey) REFERENCES customer (c_custkey); ALTER TABLE lineitem ADD CONSTRAINT lineitem_orderkey FOREIGN KEY (l_orderkey) REFERENCES orders (o_orderkey);

http://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

http://homepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/ForeignKeys-for-TPCH.sql

ALTER TABLE lineitem ADD CONSTRAINT lineitem_partkey FOREIGN KEY (l_partkey) REFERENCES part (p_partkey); ALTER TABLE lineitem ADD CONSTRAINT lineitem_suppkey FOREIGN KEY (l_suppkey) REFERENCES supplier (s_suppkey); ALTER TABLE lineitem ADD CONSTRAINT lineitem_partsuppkey FOREIGN KEY (l_partkey,l_suppkey) REFERENCES partsupp (ps_partkey,ps_suppkey; COMMIT; don't forget

Documents

Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business