41
© 2016 MapR Technologies 1 © 2016 MapR Technologies Drill Workshop Working with Delimited Data in Drill 1.6.0 May 2016

Working with Delimited Data in Apache Drill 1.6.0

Embed Size (px)

Citation preview

Page 1: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 1© 2016 MapR Technologies

Drill WorkshopWorking with Delimited Data in Drill 1.6.0May 2016

Page 2: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 2

• Schema-free, scale-out query engine for Hadoop and NoSQL

• Low latency

• Extreme ease of use

• Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs

APACHE DRILL

Page 3: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 3

Running Drill takes 10 minutes (or less)

+--------------+----------------+------------+| full_name | position_title | salary |+--------------+----------------+------------+| Sheri Nowmer | President | 80000.0 |+------------+------------------+------------+1 row selected (0.417 seconds)

DOWNLOAD https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+in+10+Minutes

EXTRACT$ tar xf apache-drill-1.6.0.tar.gz$ cd apache-drill-1.6.0

RUN $ bin/drill-embedded

> SELECT full_name, position_title, salary FROM cp.`employee.json ` LIMIT 1;QUERY

& step by step

In SQL format

Page 4: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 4

Drill’s Data Model is Flexible

JSONBSON

HBase

ParquetAvro

CSVTSV

Dynamic schemaFixed schema

Complex

Flat

Flexibility

Name Gender AgeMichael M 6Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Flexibility

Page 5: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 5

- Sub-directory- HBase namespace- Hive database

Drill Enables ‘SQL on Everything’

SELECT * FROM dfs.yelp.`business.json`

Workspace- Pathnames- Hive table- HBase table

Table

- DFS (Text, Parquet, JSON)- HBase/MapR-DB- Hive Metastore/HCatalog- Easy API to go beyond Hadoop

Storage plugin instance

Page 6: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 6

Schema Free?

• But you can use schema if you want. There are several benefits to applying

schema:

• Better comparisons

• Better compression when using Parquet

• More efficient in-memory representation

• Drill can also use schema provided by the Hive metastore

• And use 100+ Hive SerDes to read the data.

Vince Gonzalez
Gordon Gekko, Wall Street
Page 7: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 7

Introduce external data sources to Drill

SELECT * F

ROM

dfs.root.`/E:/drill

/data/yelp/r

eview.json`;

SELECT * F

ROM

dfs.yelp.`review.jso

n` LIMIT 1;

USE dfs.y

elp;

SELECT * F

ROM `review.json`

LIMIT 1;

SELECT * F

ROM hbase.users LI

MIT

1;

Storage Plugin Provider

Workspace Table

files Path Path relative to workspace

mongo Database Collection

hive Database Table

hbase Namespace Table

CurrentlySupportedProviders

cassandra Keyspace Table

jdbc Database Table

Future orExperimentalProviders

SELECT * FROM dfs.root.`/E:/drill/data/yelp/review.json`;

SELECT * FROM dfs.yelp.`review.json` LIMIT 1;

USE dfs.yelp; SELECT * FROM `review.json` LIMIT 1; SELECT * FROM hbase.users LIMIT 1;

Page 8: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 8© 2014 MapR Technologies

Let's Do Some Stuff

Page 9: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 9

The Dataset• NYC MTA Bus Times

– Covers three months of 2014– One day per file

• 92 TSV files, over 400 million rows• We will work with one file, entire data set time permitting

– 5.4 million rows– 838MB

• Your cluster already has the data on it.

Page 10: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 10

Your Cluster• Three Drillbits

• URL: http://<public_ip_here>:8047• SSH: ec2-user@<public_ip_here>

• Make sure you use the private key provided.• ssh -i drill_workshop_private_key.openssh ec2-

user@<public_ip>• putty key available

Page 11: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 11

Why Drill for delimited data?• Too large to manipulate in Excel• Maybe you want to join the data with another source?• Maybe you want a code-free ELT workflow?• Leave the original data untouched

Page 12: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 12

Our Goal• Walk through exploration of a single data set

• Prepare the data for low-latency query

• Sub-tasks– Initial Exploration– Remove rows with invalid data– Apply types to the data– Transform to a more storage/performance efficient format– Query it!

Page 13: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 13

Step 0: Get Drill, get the Data

Get Drill 1.6.0 from the closest Apache Mirror:

curl -LO 'http://www.apache.org/dyn/closer.lua?filename=drill/drill-1.6.0/apache-drill-1.6.0.tar.gz&action=download'

Get the dataset from the MTA's S3 bucket:

curl -LO http://s3.amazonaws.com/MTABusTime/AppQuest3/MTA-Bus-Time_.2014-10-31.txt.xz

Page 14: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 14

Step 1: Unpack Drill and Data

• Unpack Drill (suggested: /opt)• Unpack the Data (suggested: /tmp)

Page 15: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 15

Launch sqlline

• Distributed:– sqlline -u jdbc:drill:schema=dfs.mta

• Embedded:– $DRILL_HOME/bin/drill-embedded

• use dfs;

Page 16: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 16

Explore the Dataset

Usually, first you put the files somewhere. Check that Drill can see them.

show files in dfs.tmp;

Page 17: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 17

Explore the Dataset

See if Drill can query the .txt file:

use dfs.tmp;

select * from `MTA-Bus-Time_.2014-10-31.txt` limit 10;

Page 18: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 18

Explore the Dataset

• Drill has a dfs storage plugin• Drill comes with a workspace called tmp that points to /tmp

(maprfs:///tmp on a MapR cluster, /tmp on your laptop)• File has a .txt extension, which Drill does not understand, so the

query fails.

• Your cluster is pre-configured with a new inputFormat, called tsvh - tab separated values, with header.

• Let's rename the file, adding an extension .tsvh then try again.

What just happened there?

Page 19: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 19

Explore the Dataset

See if Drill can query the file with the new .tsvh extension:

use dfs.tmp;

select * from `MTA-Bus-Time_.2014-10-31.txt.tsvh` limit 10;

Page 20: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 20

Explore the Dataset

• Drill has a dfs storage plugin• Drill comes with a workspace called tmp that points to /tmp

(maprfs:///tmp on a MapR cluster, /tmp on your laptop)• Your cluster is pre-configured with a new inputFormat, called tsvh - tab

separated values, with header• Drill used the storage plugin configuration to know how to read the

data, and the inputFormat to figure out how to parse the text based on the extension

• Drill used the first line of each file to generate named columns• Visit your Drillbit's UI, navigate to storage, then click 'Update' next to

dfs to view the plugin configuration.

What just happened there?

Page 21: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 21

What we know so far

• Place your files somewhere Drill can find them.– dfs.tmp is not a bad place to start (it's always there)

• Name your files so Drill can read them– Default dfs plugin inputFormats provide a list of known extensions

• Create your own inputFormats, and include options if you like– extractHeader, skipFirstLine, fieldDelimiter

• Gives rise to custom formats, such as:– CSV without headers (no options)– CSV with unusable headers (skipFirstLine)– Pipe-separated with headers (extractHeader, fieldDelimiter)

Page 22: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 22

Multiple Files

• Many datasets have more than one file.

• To group files together, it's convenient to place them in a workspace.

• Navigate to your drillbit UI, let's look at the pre-created workspace.

Page 23: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 23

The MTA Workspace

"workspaces": { ... "mta": { "location": "/user/ec2-user/data/nyc/mta", "writable": true, "defaultInputFormat": "tsvh" } },

Page 24: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 24

The TSVH format

"format": { ... "tsvh": { "type": "text", "extensions": [ "tsvh" ], "extractHeader": true, "delimiter": "\t" } },

Page 25: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 25

Checkpoint

• So far we have:– Queried a delimited text file via the dfs storage plugin– Used a workspace with some options to make query easier

Page 26: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 26

Schema

• SQL has types, but so far, our data does not.

use dfs.mta;

describe bustime;

Page 27: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 27

Who needs schema?Without Schema> select min(vehicle_id) from bustime;+---------+| EXPR$0 |+---------+| 1001 |+---------+1 row selected (3.238 seconds)

With Schema> select min(cast(vehicle_id as INT)) from bustime;+---------+| EXPR$0 |+---------+| 101 |+---------+1 row selected (3.248 seconds)

Page 28: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 28

Adding Schemaselect cast(latitude as float) latitude, cast(longitude as float) longitude, cast( to_timestamp(time_received,'YYYY-MM-dd HH:mm:ss') as timestamp) as time_received, cast(vehicle_id as int) as vehicle_id, cast(distance_along_trip as float) distance_along_trip, case when inferred_direction_id not in ('NULL', '') then cast(inferred_direction_id as int) end as inferred_direction_id, cast(inferred_phase as VARCHAR(20)) inferred_phase, cast(inferred_route_id as VARCHAR(20)) inferred_route_id, cast(inferred_trip_id as VARCHAR(30)) inferred_trip_id, case when next_scheduled_stop_distance not in ('NULL', '') then cast(next_scheduled_stop_distance as float) end next_scheduled_stop_distance, cast(next_scheduled_stop_id as VARCHAR(20)) as next_scheduled_stop_id from dfs.mta.bustime limit 5;

Page 29: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 29

Adding Schema• But how do you query that? How do you edit the query?

• Subselect? Works, but this is terrible for usability.

select min(vehicle_id) from (select cast(latitude as float) latitude, cast(longitude as float) longitude, cast(to_timestamp(time_received, 'YYYY-MM-dd HH:mm:ss') as timestamp) time_received, cast(vehicle_id as int) as vehicle_id, cast(distance_along_trip as float) distance_along_trip, case when inferred_direction_id not in ('NULL', '') then cast(inferred_direction_id as int) end as inferred_direction_id, cast(inferred_phase as VARCHAR(20)) inferred_phase, cast(inferred_route_id as VARCHAR(20)) inferred_route_id, cast(inferred_trip_id as VARCHAR(30)) inferred_trip_id, case when next_scheduled_stop_distance not in ('NULL', '') then cast(next_scheduled_stop_distance as float) end next_scheduled_stop_distance, cast(next_scheduled_stop_id as VARCHAR(20)) as next_scheduled_stop_id from dfs.mta.bustime limit 5);

Page 30: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 30

Adding Schema• Create a view:

create or replace view bustime_vw as select cast(latitude as float) latitude, cast(longitude as float) longitude, cast(to_timestamp(time_received, 'YYYY-MM-dd HH:mm:ss') as timestamp) time_received, cast(vehicle_id as int) as vehicle_id, cast(distance_along_trip as float) distance_along_trip, case when inferred_direction_id not in ('NULL', '') then cast(inferred_direction_id as int) end as inferred_direction_id, cast(inferred_phase as VARCHAR(20)) inferred_phase, cast(inferred_route_id as VARCHAR(20)) inferred_route_id, cast(inferred_trip_id as VARCHAR(30)) inferred_trip_id, case when next_scheduled_stop_distance not in ('NULL', '') then cast(next_scheduled_stop_distance as float) end next_scheduled_stop_distance, cast(next_scheduled_stop_id as VARCHAR(20)) as next_scheduled_stop_id from dfs.mta.bustime;

Page 31: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 31

Now we have types…0: jdbc:drill:schema=dfs.mta> describe bustime_vw;+-------------------------------+--------------------+--------------+| COLUMN_NAME | DATA_TYPE | IS_NULLABLE |+-------------------------------+--------------------+--------------+| latitude | FLOAT | YES || longitude | FLOAT | YES || time_received | TIMESTAMP | YES || vehicle_id | INTEGER | YES || distance_along_trip | FLOAT | YES || inferred_direction_id | INTEGER | YES || inferred_phase | CHARACTER VARYING | YES || inferred_route_id | CHARACTER VARYING | YES || inferred_trip_id | CHARACTER VARYING | YES || next_scheduled_stop_distance | FLOAT | YES || next_scheduled_stop_id | CHARACTER VARYING | YES |+-------------------------------+--------------------+--------------+11 rows selected (0.131 seconds)

Page 32: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 32

…and simplified queries

> select min(vehicle_id) from bustime_vw;+---------+| EXPR$0 |+---------+| 101 |+---------+1 row selected (4.235 seconds)

Page 33: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 33

But, sometimes we get bad data…The NYC MTA does not, in fact, operate buses off the coast of Africa.

Page 34: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 34

So update the viewFilter out the bad latitudes:

create or replace view bustime_vw as select cast(latitude as float) latitude, cast(longitude as float) longitude, cast( to_timestamp(time_received, 'YYYY-MM-dd HH:mm:ss') as timestamp) time_received, cast(vehicle_id as int) as vehicle_id, cast(distance_along_trip as float) distance_along_trip, case when inferred_direction_id not in ('NULL', '') then cast(inferred_direction_id as int) end as inferred_direction_id, cast(inferred_phase as VARCHAR(20)) inferred_phase, cast(inferred_route_id as VARCHAR(20)) inferred_route_id, cast(inferred_trip_id as VARCHAR(30)) inferred_trip_id, case when next_scheduled_stop_distance not in ('NULL', '') then cast(next_scheduled_stop_distance as float) end next_scheduled_stop_distance, cast(next_scheduled_stop_id as VARCHAR(20)) as next_scheduled_stop_id

from dfs.mta.bustime where latitude > 0.0;

Page 35: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 35

After replacing the viewBetter.

Page 36: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 36

A Simple Aggregation - View0: jdbc:drill:zk=local> select inferred_route_id,count(vehicle_id) layovers from bustime_vw where inferred_phase in ('LAYOVER_BEFORE', 'LAYOVER_DURING') group by inferred_route_id order by layovers desc limit 5;+--------------------+-----------+| inferred_route_id | layovers |+--------------------+-----------+| MTA NYCT_B6 | 7778 || MTA NYCT_B46 | 5751 || MTA NYCT_Q58 | 5276 || MTA NYCT_B41 | 4912 || MTA NYCT_Q46 | 4731 |+--------------------+-----------+

5 rows selected (7.04 seconds)

SLOW!

Page 37: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 37

A Better Format - CTAS to Parquet*0: jdbc:drill:zk=local> create table bustime_pq as select * from bustime_vw;+-----------+----------------------------+| Fragment | Number of records written |+-----------+----------------------------+| 0_0 | 5411159 |+-----------+----------------------------+1 row selected (48.69 seconds)

*Parquet is the default format for CTAS.

Page 38: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 38

A Simple Aggregation - Parquet0: jdbc:drill:zk=local> select inferred_route_id,count(vehicle_id) layovers from bustime_pq where inferred_phase in ('LAYOVER_BEFORE', 'LAYOVER_DURING') group by inferred_route_id order by layovers desc limit 5;+--------------------+-----------+| inferred_route_id | layovers |+--------------------+-----------+| MTA NYCT_B6 | 7778 || MTA NYCT_B46 | 5751 || MTA NYCT_Q58 | 5276 || MTA NYCT_B41 | 4912 || MTA NYCT_Q46 | 4731 |+--------------------+-----------+

5 rows selected (1.959 seconds)

Better!

Page 39: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 39

Checkpoint• So far we have:

– Queried a delimited text file via the dfs storage plugin– Used a workspace with some options to make query easier

– Added types to columns– Created a view, filtering bad data– Transformed the data to parquet for faster query

• Parquet - columnar, compressed

Page 40: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 40

Summary - Workflow for Delimited Text– Drop a sample file in /tmp (local) or maprfs:///tmp or hdfs://tmp– Exploratory queries (SELECT *) via dfs.tmp workspace– Create a workspace for your data pointing to a directory with the entire

data set– Understand the format and set options on inputFormat

• fieldDelimiter (multi-byte delimiters not yet supported)• extractHeader, skipFirstLine

– Add types to columns with CAST– Create a view over the typed columns– Transform the data to parquet for faster query

• Parquet - columnar, compressed

Page 41: Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 41

Q & A@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies