Working with Delimited Data in Apache Drill 1.6.0

© 2016 MapR Technologies 1© 2016 MapR Technologies

Drill WorkshopWorking with Delimited Data in Drill 1.6.0May 2016

© 2016 MapR Technologies 2

• Schema-free, scale-out query engine for Hadoop and NoSQL

• Low latency

• Extreme ease of use

• Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs

APACHE DRILL


Running Drill takes 10 minutes (or less)

+--------------+----------------+------------+| full_name | position_title | salary |+--------------+----------------+------------+| Sheri Nowmer | President | 80000.0 |+------------+------------------+------------+1 row selected (0.417 seconds)

DOWNLOAD https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+in+10+Minutes

EXTRACT$ tar xf apache-drill-1.6.0.tar.gz$ cd apache-drill-1.6.0

RUN $ bin/drill-embedded

> SELECT full_name, position_title, salary FROM cp.`employee.json ` LIMIT 1;QUERY

& step by step

In SQL format

https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+in+10+Minutes


Drill’s Data Model is Flexible

JSONBSON

HBase

ParquetAvro

CSVTSV

Dynamic schemaFixed schema

Complex

Flat

Flexibility

Name Gender AgeMichael M 6Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Flexibility


- Sub-directory- HBase namespace- Hive database

Drill Enables ‘SQL on Everything’

SELECT * FROM dfs.yelp.`business.json`

Workspace- Pathnames- Hive table- HBase table

Table

- DFS (Text, Parquet, JSON)- HBase/MapR-DB- Hive Metastore/HCatalog- Easy API to go beyond Hadoop

Storage plugin instance


Schema Free?

• But you can use schema if you want. There are several benefits to applying

schema:

• Better comparisons

• Better compression when using Parquet

• More efficient in-memory representation

• Drill can also use schema provided by the Hive metastore

• And use 100+ Hive SerDes to read the data.

Vince Gonzalez

Gordon Gekko, Wall Street


Introduce external data sources to Drill

SELECT * F

ROM

dfs.root.`/E:/drill

/data/yelp/r

eview.json`;

SELECT * F

ROM

dfs.yelp.`review.jso

n` LIMIT 1;

USE dfs.y

elp;

SELECT * F

ROM `review.json`

LIMIT 1;

SELECT * F

ROM hbase.users LI

MIT

1;

Storage Plugin Provider

Workspace Table

files Path Path relative to workspace

mongo Database Collection

hive Database Table

hbase Namespace Table

CurrentlySupportedProviders

cassandra Keyspace Table

jdbc Database Table

Future orExperimentalProviders

SELECT * FROM dfs.root.`/E:/drill/data/yelp/review.json`;

SELECT * FROM dfs.yelp.`review.json` LIMIT 1;

USE dfs.yelp; SELECT * FROM `review.json` LIMIT 1; SELECT * FROM hbase.users LIMIT 1;

© 2016 MapR Technologies 8© 2014 MapR Technologies

Let's Do Some Stuff


The Dataset• NYC MTA Bus Times

– Covers three months of 2014– One day per file

• 92 TSV files, over 400 million rows• We will work with one file, entire data set time permitting

– 5.4 million rows– 838MB

• Your cluster already has the data on it.


Your Cluster• Three Drillbits

• URL: http://<public_ip_here>:8047• SSH: ec2-user@<public_ip_here>

• Make sure you use the private key provided.• ssh -i drill_workshop_private_key.openssh ec2-

user@<public_ip>• putty key available


Why Drill for delimited data?• Too large to manipulate in Excel• Maybe you want to join the data with another source?• Maybe you want a code-free ELT workflow?• Leave the original data untouched


Our Goal• Walk through exploration of a single data set

• Prepare the data for low-latency query

• Sub-tasks– Initial Exploration– Remove rows with invalid data– Apply types to the data– Transform to a more storage/performance efficient format– Query it!


Step 0: Get Drill, get the Data

Get Drill 1.6.0 from the closest Apache Mirror:

curl -LO 'http://www.apache.org/dyn/closer.lua?filename=drill/drill-1.6.0/apache-drill-1.6.0.tar.gz&action=download'

Get the dataset from the MTA's S3 bucket:

curl -LO http://s3.amazonaws.com/MTABusTime/AppQuest3/MTA-Bus-Time_.2014-10-31.txt.xz


Step 1: Unpack Drill and Data

• Unpack Drill (suggested: /opt)• Unpack the Data (suggested: /tmp)


Launch sqlline

• Distributed:– sqlline -u jdbc:drill:schema=dfs.mta

• Embedded:– $DRILL_HOME/bin/drill-embedded

• use dfs;


Explore the Dataset

Usually, first you put the files somewhere. Check that Drill can see them.

show files in dfs.tmp;


Explore the Dataset

See if Drill can query the .txt file:

use dfs.tmp;

select * from `MTA-Bus-Time_.2014-10-31.txt` limit 10;


Explore the Dataset

• Drill has a dfs storage plugin• Drill comes with a workspace called tmp that points to /tmp

(maprfs:///tmp on a MapR cluster, /tmp on your laptop)• File has a .txt extension, which Drill does not understand, so the

query fails.

• Your cluster is pre-configured with a new inputFormat, called tsvh - tab separated values, with header.

• Let's rename the file, adding an extension .tsvh then try again.

What just happened there?


Explore the Dataset

See if Drill can query the file with the new .tsvh extension:

use dfs.tmp;

select * from `MTA-Bus-Time_.2014-10-31.txt.tsvh` limit 10;


Explore the Dataset

• Drill has a dfs storage plugin• Drill comes with a workspace called tmp that points to /tmp

(maprfs:///tmp on a MapR cluster, /tmp on your laptop)• Your cluster is pre-configured with a new inputFormat, called tsvh - tab

separated values, with header• Drill used the storage plugin configuration to know how to read the

data, and the inputFormat to figure out how to parse the text based on the extension

• Drill used the first line of each file to generate named columns• Visit your Drillbit's UI, navigate to storage, then click 'Update' next to

dfs to view the plugin configuration.

What just happened there?


What we know so far

• Place your files somewhere Drill can find them.– dfs.tmp is not a bad place to start (it's always there)

• Name your files so Drill can read them– Default dfs plugin inputFormats provide a list of known extensions

• Create your own inputFormats, and include options if you like– extractHeader, skipFirstLine, fieldDelimiter

• Gives rise to custom formats, such as:– CSV without headers (no options)– CSV with unusable headers (skipFirstLine)– Pipe-separated with headers (extractHeader, fieldDelimiter)


Multiple Files

• Many datasets have more than one file.

• To group files together, it's convenient to place them in a workspace.

• Navigate to your drillbit UI, let's look at the pre-created workspace.


The MTA Workspace

"workspaces": { ... "mta": { "location": "/user/ec2-user/data/nyc/mta", "writable": true, "defaultInputFormat": "tsvh" } },


The TSVH format

"format": { ... "tsvh": { "type": "text", "extensions": [ "tsvh" ], "extractHeader": true, "delimiter": "\t" } },


Checkpoint

• So far we have:– Queried a delimited text file via the dfs storage plugin– Used a workspace with some options to make query easier


Schema

• SQL has types, but so far, our data does not.

use dfs.mta;

describe bustime;


Who needs schema?Without Schema> select min(vehicle_id) from bustime;+---------+| EXPR$0 |+---------+| 1001 |+---------+1 row selected (3.238 seconds)

With Schema> select min(cast(vehicle_id as INT)) from bustime;+---------+| EXPR$0 |+---------+| 101 |+---------+1 row selected (3.248 seconds)


Adding Schemaselect cast(latitude as float) latitude, cast(longitude as float) longitude, cast( to_timestamp(time_received,'YYYY-MM-dd HH:mm:ss') as timestamp) as time_received, cast(vehicle_id as int) as vehicle_id, cast(distance_along_trip as float) distance_along_trip, case when inferred_direction_id not in ('NULL', '') then cast(inferred_direction_id as int) end as inferred_direction_id, cast(inferred_phase as VARCHAR(20)) inferred_phase, cast(inferred_route_id as VARCHAR(20)) inferred_route_id, cast(inferred_trip_id as VARCHAR(30)) inferred_trip_id, case when next_scheduled_stop_distance not in ('NULL', '') then cast(next_scheduled_stop_distance as float) end next_scheduled_stop_distance, cast(next_scheduled_stop_id as VARCHAR(20)) as next_scheduled_stop_id from dfs.mta.bustime limit 5;


Adding Schema• But how do you query that? How do you edit the query?

• Subselect? Works, but this is terrible for usability.

select min(vehicle_id) from (select cast(latitude as float) latitude, cast(longitude as float) longitude, cast(to_timestamp(time_received, 'YYYY-MM-dd HH:mm:ss') as timestamp) time_received, cast(vehicle_id as int) as vehicle_id, cast(distance_along_trip as float) distance_along_trip, case when inferred_direction_id not in ('NULL', '') then cast(inferred_direction_id as int) end as inferred_direction_id, cast(inferred_phase as VARCHAR(20)) inferred_phase, cast(inferred_route_id as VARCHAR(20)) inferred_route_id, cast(inferred_trip_id as VARCHAR(30)) inferred_trip_id, case when next_scheduled_stop_distance not in ('NULL', '') then cast(next_scheduled_stop_distance as float) end next_scheduled_stop_distance, cast(next_scheduled_stop_id as VARCHAR(20)) as next_scheduled_stop_id from dfs.mta.bustime limit 5);


Adding Schema• Create a view:

create or replace view bustime_vw as select cast(latitude as float) latitude, cast(longitude as float) longitude, cast(to_timestamp(time_received, 'YYYY-MM-dd HH:mm:ss') as timestamp) time_received, cast(vehicle_id as int) as vehicle_id, cast(distance_along_trip as float) distance_along_trip, case when inferred_direction_id not in ('NULL', '') then cast(inferred_direction_id as int) end as inferred_direction_id, cast(inferred_phase as VARCHAR(20)) inferred_phase, cast(inferred_route_id as VARCHAR(20)) inferred_route_id, cast(inferred_trip_id as VARCHAR(30)) inferred_trip_id, case when next_scheduled_stop_distance not in ('NULL', '') then cast(next_scheduled_stop_distance as float) end next_scheduled_stop_distance, cast(next_scheduled_stop_id as VARCHAR(20)) as next_scheduled_stop_id from dfs.mta.bustime;


Now we have types…0: jdbc:drill:schema=dfs.mta> describe bustime_vw;+-------------------------------+--------------------+--------------+| COLUMN_NAME | DATA_TYPE | IS_NULLABLE |+-------------------------------+--------------------+--------------+| latitude | FLOAT | YES || longitude | FLOAT | YES || time_received | TIMESTAMP | YES || vehicle_id | INTEGER | YES || distance_along_trip | FLOAT | YES || inferred_direction_id | INTEGER | YES || inferred_phase | CHARACTER VARYING | YES || inferred_route_id | CHARACTER VARYING | YES || inferred_trip_id | CHARACTER VARYING | YES || next_scheduled_stop_distance | FLOAT | YES || next_scheduled_stop_id | CHARACTER VARYING | YES |+-------------------------------+--------------------+--------------+11 rows selected (0.131 seconds)


…and simplified queries

> select min(vehicle_id) from bustime_vw;+---------+| EXPR$0 |+---------+| 101 |+---------+1 row selected (4.235 seconds)


But, sometimes we get bad data…The NYC MTA does not, in fact, operate buses off the coast of Africa.


So update the viewFilter out the bad latitudes:

create or replace view bustime_vw as select cast(latitude as float) latitude, cast(longitude as float) longitude, cast( to_timestamp(time_received, 'YYYY-MM-dd HH:mm:ss') as timestamp) time_received, cast(vehicle_id as int) as vehicle_id, cast(distance_along_trip as float) distance_along_trip, case when inferred_direction_id not in ('NULL', '') then cast(inferred_direction_id as int) end as inferred_direction_id, cast(inferred_phase as VARCHAR(20)) inferred_phase, cast(inferred_route_id as VARCHAR(20)) inferred_route_id, cast(inferred_trip_id as VARCHAR(30)) inferred_trip_id, case when next_scheduled_stop_distance not in ('NULL', '') then cast(next_scheduled_stop_distance as float) end next_scheduled_stop_distance, cast(next_scheduled_stop_id as VARCHAR(20)) as next_scheduled_stop_id

from dfs.mta.bustime where latitude > 0.0;


After replacing the viewBetter.


A Simple Aggregation - View0: jdbc:drill:zk=local> select inferred_route_id,count(vehicle_id) layovers from bustime_vw where inferred_phase in ('LAYOVER_BEFORE', 'LAYOVER_DURING') group by inferred_route_id order by layovers desc limit 5;+--------------------+-----------+| inferred_route_id | layovers |+--------------------+-----------+| MTA NYCT_B6 | 7778 || MTA NYCT_B46 | 5751 || MTA NYCT_Q58 | 5276 || MTA NYCT_B41 | 4912 || MTA NYCT_Q46 | 4731 |+--------------------+-----------+

5 rows selected (7.04 seconds)

SLOW!


A Better Format - CTAS to Parquet*0: jdbc:drill:zk=local> create table bustime_pq as select * from bustime_vw;+-----------+----------------------------+| Fragment | Number of records written |+-----------+----------------------------+| 0_0 | 5411159 |+-----------+----------------------------+1 row selected (48.69 seconds)

*Parquet is the default format for CTAS.


A Simple Aggregation - Parquet0: jdbc:drill:zk=local> select inferred_route_id,count(vehicle_id) layovers from bustime_pq where inferred_phase in ('LAYOVER_BEFORE', 'LAYOVER_DURING') group by inferred_route_id order by layovers desc limit 5;+--------------------+-----------+| inferred_route_id | layovers |+--------------------+-----------+| MTA NYCT_B6 | 7778 || MTA NYCT_B46 | 5751 || MTA NYCT_Q58 | 5276 || MTA NYCT_B41 | 4912 || MTA NYCT_Q46 | 4731 |+--------------------+-----------+

5 rows selected (1.959 seconds)

Better!


Checkpoint• So far we have:

– Queried a delimited text file via the dfs storage plugin– Used a workspace with some options to make query easier

– Added types to columns– Created a view, filtering bad data– Transformed the data to parquet for faster query

• Parquet - columnar, compressed


Summary - Workflow for Delimited Text– Drop a sample file in /tmp (local) or maprfs:///tmp or hdfs://tmp– Exploratory queries (SELECT *) via dfs.tmp workspace– Create a workspace for your data pointing to a directory with the entire

data set– Understand the format and set options on inputFormat

• fieldDelimiter (multi-byte delimiters not yet supported)• extractHeader, skipFirstLine

– Add types to columns with CAST– Create a view over the typed columns– Transform the data to parquet for faster query

• Parquet - columnar, compressed


Q & A@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies

Data & Analytics

Working with Delimited Data in Apache Drill 1.6.0