26
Cloud SQL Shootout: TPCH on Redshift and Hive Today's business analyst demands a SQLlike access to Big Data TM . Your task today is to design a SQLinthecloud data warehouse system. You compare Hive and AWS Redshift, which is a hosted version of the parallel database system Actian Matrix (probably better known under its previous name ParAccel). As a benchmark we are going to use TPCH, a industry standard benchmark for analytical SQL systems. http://www.tpc.org/tpch/ Three data sets with scale factors 1, 10 and 300 have already been created for you and uploaded to S3. TPCH scale factor 300 means the largest table "lineitem" has 1.799.989.091 records. Preparation for Redshift Install SQL Workbench and the Redshift JDBC driver http://www.sqlworkbench.net/ Installation instructions: http://docs.aws.amazon.com/redshift/latest/mgmt/connectingusingworkbench.html Redshift JDBC driver can be downloaded from http://homepages.cwi.nl/~hannes/RedshiftJDBC411.1.7.1007.jar This time, you will need your AWS access credentials. Create a new access key: Go to "Security Credentials" in the console

Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Cloud SQL Shootout: TPC­H on Redshift and Hive Today's business analyst demands a SQL­like access to Big DataTM . Your task today is to design a SQL­in­the­cloud data warehouse system. You compare Hive and AWS Redshift, which is a hosted version of the parallel database system Actian Matrix (probably better known under its previous name ParAccel). As a benchmark we are going to use TPC­H, a industry standard benchmark for analytical SQL systems. http://www.tpc.org/tpch/ Three data sets with scale factors 1, 10 and 300 have already been created for you and uploaded to S3. TPC­H scale factor 300 means the largest table "lineitem" has 1.799.989.091 records.

Preparation for Redshift Install SQL Workbench and the Redshift JDBC driver http://www.sql­workbench.net/ Installation instructions: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting­using­workbench.html Redshift JDBC driver can be downloaded from http://homepages.cwi.nl/~hannes/RedshiftJDBC41­1.1.7.1007.jar This time, you will need your AWS access credentials. Create a new access key: Go to "Security Credentials" in the console

Page 2: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Note down your access key id and secret access key somewhere where you will find it again. You can also simply download the key file which contains this information.

Page 3: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Redshift Startup Go to the AWS Redshift console and click the "Launch Cluster" Button

Page 4: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

For now, we will create a single­node cluster for testing purposes.

Page 5: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Make sure you note the username and password, you will need it later. Database name can remain empty.

Page 6: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Select a single node cluster for now

Page 7: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Additional configuration can all remain on defaults Then launch

Page 8: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Go to the clusters dashboard, you will see your cluster launching

Page 9: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Wait until the cluster is available

Click the cluster name ("bads­test1" here) to show its details It might show a warning due to the firewall not allowing access to the JDBC port. Click the warning sign and then "Edit security group"

Page 10: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Edit the security group on the "inbound" tab to allow Redshift connections from anywhere:

Afterwards, the warning should be gone.

Page 11: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

If connecting with SQL Workbench (see below) does not work ­­­ either after above changes or although you don’t get the above warning about “No Inbound Permissions” in the first place ­­­ please try the following in the EC2 console to open­up access after all ­­­ THANKS to Taco Wijnsma for the hint!

Page 12: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Then, use the displayed JDBC URL to connect using SQL Workbench.

Page 13: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Redshift Schema / Data loading Run the following query in the SQL Workbench to create and load the tables (a plain text file is available at http://homepages.cwi.nl/~manegold/UvA­ABS­MBA­BDBA­ISfBD/RedshiftSchemaDataLoading.sql). CREATE TABLE region (r_regionkey INT NOT NULL, r_name VARCHAR(25) NOT NULL, r_comment VARCHAR(152) NOT NULL, PRIMARY KEY (r_regionkey)) ; CREATE TABLE nation (n_nationkey INT NOT NULL, n_name VARCHAR(25) NOT NULL, n_regionkey INT NOT NULL, n_comment VARCHAR(152) NOT NULL, PRIMARY KEY (n_nationkey)) ; CREATE TABLE supplier (s_suppkey INT NOT NULL, s_name VARCHAR(25) NOT NULL, s_address VARCHAR(40) NOT NULL, s_nationkey INT NOT NULL, s_phone VARCHAR(15) NOT NULL, s_acctbal DECIMAL(15,2) NOT NULL, s_comment VARCHAR(101) NOT NULL, PRIMARY KEY (s_suppkey)) ; CREATE TABLE customer (c_custkey INT NOT NULL, c_name VARCHAR(25) NOT NULL, c_address VARCHAR(40) NOT NULL, c_nationkey INT NOT NULL, c_phone VARCHAR(15) NOT NULL, c_acctbal DECIMAL(15,2) NOT NULL, c_mktsegment VARCHAR(10) NOT NULL, c_comment VARCHAR(117) NOT NULL, PRIMARY KEY (c_custkey)) ; CREATE TABLE part (p_partkey INT NOT NULL, p_name VARCHAR(55) NOT NULL, p_mfgr VARCHAR(25) NOT NULL, p_brand VARCHAR(10) NOT NULL, p_type VARCHAR(25) NOT NULL, p_size INT NOT NULL, p_container VARCHAR(10) NOT NULL, p_retailprice DECIMAL(15,2) NOT NULL, p_comment VARCHAR(23) NOT NULL, PRIMARY KEY (p_partkey)) ; CREATE TABLE partsupp (ps_partkey INT NOT NULL, ps_suppkey INT NOT NULL, ps_availqty INT NOT NULL, ps_supplycost DECIMAL(15,2) NOT NULL, ps_comment VARCHAR(199) NOT NULL, PRIMARY KEY (ps_partkey, ps_suppkey), FOREIGN KEY (ps_partkey) REFERENCES part (p_partkey), FOREIGN KEY (ps_suppkey) REFERENCES supplier (s_suppkey)) ; CREATE TABLE orders (o_orderkey INT NOT NULL, o_custkey INT NOT NULL, o_orderstatus VARCHAR(1) NOT NULL, o_totalprice DECIMAL(15,2) NOT NULL, o_orderdate DATE NOT NULL, o_orderpriority VARCHAR(15) NOT NULL, o_clerk VARCHAR(15) NOT NULL, o_shippriority INT NOT NULL, o_comment VARCHAR(79) NOT NULL, PRIMARY KEY (o_orderkey)) ; CREATE TABLE lineitem (l_orderkey INT NOT NULL, l_partkey INT NOT NULL, l_suppkey INT NOT NULL, l_linenumber INT NOT NULL, l_quantity INTEGER NOT NULL, l_extendedprice DECIMAL(15,2) NOT NULL, l_discount DECIMAL(15,2) NOT NULL, l_tax DECIMAL(15,2) NOT NULL, l_returnflag VARCHAR(1) NOT NULL, l_linestatus VARCHAR(1) NOT NULL, l_shipdate DATE NOT NULL, l_commitdate DATE NOT NULL, l_receiptdate DATE NOT NULL, l_shipinstruct VARCHAR(25) NOT NULL, l_shipmode VARCHAR(10) NOT NULL, l_comment VARCHAR(44) NOT NULL, PRIMARY KEY (l_orderkey,l_linenumber)) ; COMMIT;

Page 14: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

­­ In the remainder, replace XXXXX / YYYYY with your access key / secret access key! copy region from 's3://tpch­bads­data/sf1/region/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy nation from 's3://tpch­bads­data/sf1/nation/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy customer from 's3://tpch­bads­data/sf1/customer/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy orders from 's3://tpch­bads­data/sf1/orders/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy lineitem from 's3://tpch­bads­data/sf1/lineitem/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy part from 's3://tpch­bads­data/sf1/part/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy partsupp from 's3://tpch­bads­data/sf1/partsupp/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; copy supplier from 's3://tpch­bads­data/sf1/supplier/' delimiter '|' gzip credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=YYYYY'; COMMIT;

Replace XXXXX / YYYYY with your access key / secret access key! You can observe your cluster working by going to the "Queries" Tab in the cluster details on the Redshift console.

Page 15: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Make sure the data is loaded by running a SELECT COUNT(*) FROM table for all the loaded tables after the COMMIT, i.e., (a plain text file is available at http://homepages.cwi.nl/~manegold/UvA­ABS­MBA­BDBA­ISfBD/SELECT_COUNT.sql) SELECT count(*) from region; SELECT count(*) from nation; SELECT count(*) from supplier; SELECT count(*) from customer; SELECT count(*) from part; SELECT count(*) from partsupp; SELECT count(*) from orders; SELECT count(*) from lineitem; Run the following queries and note their runtime.

Redshift TPC­H Query 1 (a plain text file is available at http://homepages.cwi.nl/~manegold/UvA­ABS­MBA­BDBA­ISfBD/Redshift­TPCH­Query1.sql)

select l_returnflag,

Page 16: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 ­ l_discount)) as sum_disc_price, sum(l_extendedprice * (1 ­ l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order

from lineitem

where l_shipdate <= date '1998­12­01' ­ interval '108' day

group by l_returnflag, l_linestatus

order by l_returnflag, l_linestatus;

Redshift TPC­H Query 5 (a plain text file is available at http://homepages.cwi.nl/~manegold/UvA­ABS­MBA­BDBA­ISfBD/Redshift­TPCH­Query5.sql) select

n_name, sum(l_extendedprice * (1 ­ l_discount)) as revenue

from customer, orders, lineitem, supplier, nation, region

where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'MIDDLE EAST'

group by n_name

order by revenue desc;

Page 17: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Redshift Shutting Down After done with your queries, shut down your cluster

Page 18: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Hive Schema / Data loading Startup EMR cluster (all defaults, 2 nodes, remember Hue security group)

Access Hue's Hive query editor

Page 19: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Set S3 Credentials for Hive

Set first Key to fs.s3n.awsAccessKeyId, first value your S3 access key Set second Value to fs.s3n.awsSecretAccessKey, second value to your S3 secret key Create and load tables (not really loading…) (a plain text file is available at

Page 20: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

http://homepages.cwi.nl/~manegold/UvA­ABS­MBA­BDBA­ISfBD/HivesSchemaDataLoading.sql) CREATE EXTERNAL TABLE customer( C_CustKey int , C_Name varchar(64) , C_Address varchar(64) , C_NationKey int , C_Phone varchar(64) , C_AcctBal decimal(13, 2) , C_MktSegment varchar(64) , C_Comment varchar(120) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch­bads­data/sf1/customer/'; CREATE EXTERNAL TABLE lineitem( L_OrderKey int , L_PartKey int , L_SuppKey int , L_LineNumber int , L_Quantity int , L_ExtendedPrice decimal(13, 2) , L_Discount decimal(13, 2) , L_Tax decimal(13, 2) , L_ReturnFlag varchar(64) , L_LineStatus varchar(64) , L_ShipDate date , L_CommitDate date , L_ReceiptDate date , L_ShipInstruct varchar(64) , L_ShipMode varchar(64) , L_Comment varchar(64) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch­bads­data/sf1/lineitem/'; CREATE EXTERNAL TABLE nation( N_NationKey int , N_Name varchar(64) , N_RegionKey int , N_Comment varchar(160) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch­bads­data/sf1/nation/'; CREATE EXTERNAL TABLE orders( O_OrderKey int , O_CustKey int ,

Page 21: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

O_OrderStatus varchar(64) , O_TotalPrice decimal(13, 2) , O_OrderDate date , O_OrderPriority varchar(15) , O_Clerk varchar(64) , O_ShipPriority int , O_Comment varchar(80) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch­bads­data/sf1/orders/'; CREATE EXTERNAL TABLE part( P_PartKey int , P_Name varchar(64) , P_Mfgr varchar(64) , P_Brand varchar(64) , P_Type varchar(64) , P_Size int , P_Container varchar(64) , P_RetailPrice decimal(13, 2) , P_Comment varchar(64) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch­bads­data/sf1/part/'; CREATE EXTERNAL TABLE partsupp( PS_PartKey int , PS_SuppKey int , PS_AvailQty int , PS_SupplyCost decimal(13, 2) , PS_Comment varchar(200) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch­bads­data/sf1/partsupp/'; CREATE EXTERNAL TABLE region( R_RegionKey int , R_Name varchar(64) , R_Comment varchar(160) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch­bads­data/sf1/region/'; CREATE EXTERNAL TABLE supplier( S_SuppKey int , S_Name varchar(64) , S_Address varchar(64) , S_NationKey int , S_Phone varchar(18) , S_AcctBal decimal(13, 2) ,

Page 22: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

S_Comment varchar(105) , skip varchar(64) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3n://tpch­bads­data/sf1/supplier/'; Again, you might want to make sure the data is loaded by running a SELECT COUNT(*) FROM table for all the loaded tables after the COMMIT, i.e., (a plain text file is available at http://homepages.cwi.nl/~manegold/UvA­ABS­MBA­BDBA­ISfBD/SELECT_COUNT.sql) SELECT count(*) from region; SELECT count(*) from nation; SELECT count(*) from supplier; SELECT count(*) from customer; SELECT count(*) from part; SELECT count(*) from partsupp; SELECT count(*) from orders; SELECT count(*) from lineitem; Please note whether and if so how the execution times for running the SELECT COUNT(*) FROM table queries differ between RedShift & Hive. If they do, any idea why?

Hive TPC­H Query 1 (a plain text file is available at http://homepages.cwi.nl/~manegold/UvA­ABS­MBA­BDBA­ISfBD/Hives­TPCH­Query1.sql) SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(1­L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(1­L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), COUNT(1) FROM lineitem WHERE L_SHIPDATE<='1998­09­02' GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG, L_LINESTATUS;

Hive TPC­H Query 5 (a plain text file is available at http://homepages.cwi.nl/~manegold/UvA­ABS­MBA­BDBA­ISfBD/Hives­TPCH­Query5.sql) select n_name, sum(l_extendedprice * (1 ­ l_discount)) as revenue

Page 23: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

from customer c join ( select n_name, l_extendedprice, l_discount, s_nationkey, o_custkey from orders o join ( select n_name, l_extendedprice, l_discount, l_orderkey, s_nationkey from lineitem l join ( select n_name, s_suppkey, s_nationkey from supplier s join ( select n_name, n_nationkey from nation n join region r on n.n_regionkey = r.r_regionkey and r.r_name = 'MIDDLE EAST' ) n1 on s.s_nationkey = n1.n_nationkey ) s1 on l.l_suppkey = s1.s_suppkey ) l1 on l1.l_orderkey = o.o_orderkey ) o1 on c.c_nationkey = o1.s_nationkey and c.c_custkey = o1.o_custkey group by n_name order by revenue desc;

Page 24: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

(Non­cloud) alternative: MonetDB In case one of the above cloud approaches does not work for you in due time (report why not!), or in case you are just curious, you can use MonetDB (http://www.monetdb.org/) (on your own laptop/desktop/workstation!) as an alternative. See the MonetDB website at https://www.monetdb.org/Downloads and https://www.monetdb.org/Documentation/Guide/Installation for instructions how to download and install the latest release of MonetDB (Jul2015­SP4) on your system, https://www.monetdb.org/Documentation/UserGuide/Tutorial for a general tutorial how to use it and https://www.monetdb.org/Documentation/Cookbooks/SQLrecipies/Clients/SQLWorkbench for documentation how to connect SQL Workbench (that we used with RedShift above) to MonetDB (however, you do not necessarily need to use SQL Workbench or any other graphcal client interface; simply using the MonetDB­provided textual / concole­based `mclient` is enought for this exercise). You find TPC­H data as compressed CSV files for scale factors 1, 3, 10 (i.e., sizes 1GB, 3GB, 10GB) at http://homepages.cwi.nl/~manegold/TPC­H/ . Please download all files for the scale factor(s) you want to try into a directory (use separate directory per scalefactor) on your machine (and recall the entire absolute path to that directory). There is no need to unpack / decompress these files, as MonetDB can bulk­load data directly from compressed CSV files. Then start the MonetDB server (mserver) as per the instructions on the MonetDB website (see above for links). At http://homepages.cwi.nl/~manegold/TPC­H/ you also find the SQL scripts for MonetDB to create the database schema (tables), load the data (NOTE: in load_data.sql you need to replace “_MyDataPath_” with the entire absolute path to the directory where you downloaded the data files to!), run TPC­H queries 1 & 5, create foreign keys, and drop the tables, again. Run these scripts (or their content) via mclient, SQL Workbench, or you favorite SQL client.

Page 25: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

Tasks: Load the datasets and run the queries at least three times. Note the time it takes

Hive/Redshift to complete Query 1 and 5 separately. Are the results the same? If not, why not?

Run the queries on a 10­node cluster and the Scale factor 300 data in prefix s3://tpch­bads­data/sf300/

Hint: Look at the loads/queries tab in the Redshift console to monitor progress. Hint: A medium­size data set is available in

s3://tpch­bads­data/sf10/ , this can help seeing performance differences on a two­node cluster.

Visualize runtimes and try to explain why they are different (if they are) Include initial (and perhaps optimized) query plans in your report Bonus: Using the schema and the queries, design and explain a partitioning scheme for

Hive and Redshift that optimizes the runtime of the two queries. Run the queries again and measure. Check query plans to look for the impact of partitioning changes.

Redshift partitoning guide: http://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html

Hive partitioning guide: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Hint: Look at individual joins, and the size of the tables involved. Hint: For Hive, consider moving the data from S3 into HDFS using the CREATE

TABLE AS SELECT … method. Multiple schemas can help here.

Hint: Foreign keys for TPC­H (a plain text file is available at http://homepages.cwi.nl/~manegold/UvA­ABS­MBA­BDBA­ISfBD/ForeignKeys­for­TPCH.sql) ALTER TABLE nation ADD CONSTRAINT nation_regionkey FOREIGN KEY (n_regionkey) REFERENCES region (r_regionkey); ALTER TABLE supplier ADD CONSTRAINT supplier_nationkey FOREIGN KEY (s_nationkey) REFERENCES nation (n_nationkey); ALTER TABLE customer ADD CONSTRAINT customer_nationkey FOREIGN KEY (c_nationkey) REFERENCES nation (n_nationkey); ALTER TABLE partsupp ADD CONSTRAINT partsupp_partkey FOREIGN KEY (ps_partkey) REFERENCES part (p_partkey); ALTER TABLE partsupp ADD CONSTRAINT partsupp_suppkey FOREIGN KEY (ps_suppkey) REFERENCES supplier (s_suppkey); ALTER TABLE orders ADD CONSTRAINT order_custkey FOREIGN KEY (o_custkey) REFERENCES customer (c_custkey); ALTER TABLE lineitem ADD CONSTRAINT lineitem_orderkey FOREIGN KEY (l_orderkey) REFERENCES orders (o_orderkey);

Page 26: Cloud SQL Shootout: TPCH on Redshift and Hive - CWIhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/Hivevs.Redshift.pdf · Cloud SQL Shootout: TPCH on Redshift and Hive Today's business

ALTER TABLE lineitem ADD CONSTRAINT lineitem_partkey FOREIGN KEY (l_partkey) REFERENCES part (p_partkey); ALTER TABLE lineitem ADD CONSTRAINT lineitem_suppkey FOREIGN KEY (l_suppkey) REFERENCES supplier (s_suppkey); ALTER TABLE lineitem ADD CONSTRAINT lineitem_partsuppkey FOREIGN KEY (l_partkey,l_suppkey) REFERENCES partsupp (ps_partkey,ps_suppkey; COMMIT; ­­ don't forget