Upload
vince-gonzalez
View
90
Download
0
Embed Size (px)
Citation preview
© 2016 MapR Technologies 1© 2016 MapR Technologies
Drill WorkshopWorking with Delimited Data in Drill 1.6.0May 2016
© 2016 MapR Technologies 2
• Schema-free, scale-out query engine for Hadoop and NoSQL
• Low latency
• Extreme ease of use
• Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
APACHE DRILL
© 2016 MapR Technologies 3
Running Drill takes 10 minutes (or less)
+--------------+----------------+------------+| full_name | position_title | salary |+--------------+----------------+------------+| Sheri Nowmer | President | 80000.0 |+------------+------------------+------------+1 row selected (0.417 seconds)
DOWNLOAD https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+in+10+Minutes
EXTRACT$ tar xf apache-drill-1.6.0.tar.gz$ cd apache-drill-1.6.0
RUN $ bin/drill-embedded
> SELECT full_name, position_title, salary FROM cp.`employee.json ` LIMIT 1;QUERY
& step by step
In SQL format
© 2016 MapR Technologies 4
Drill’s Data Model is Flexible
JSONBSON
HBase
ParquetAvro
CSVTSV
Dynamic schemaFixed schema
Complex
Flat
Flexibility
Name Gender AgeMichael M 6Jennifer F 3
{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flexibility
© 2016 MapR Technologies 5
- Sub-directory- HBase namespace- Hive database
Drill Enables ‘SQL on Everything’
SELECT * FROM dfs.yelp.`business.json`
Workspace- Pathnames- Hive table- HBase table
Table
- DFS (Text, Parquet, JSON)- HBase/MapR-DB- Hive Metastore/HCatalog- Easy API to go beyond Hadoop
Storage plugin instance
© 2016 MapR Technologies 6
Schema Free?
• But you can use schema if you want. There are several benefits to applying
schema:
• Better comparisons
• Better compression when using Parquet
• More efficient in-memory representation
• Drill can also use schema provided by the Hive metastore
• And use 100+ Hive SerDes to read the data.
© 2016 MapR Technologies 7
Introduce external data sources to Drill
SELECT * F
ROM
dfs.root.`/E:/drill
/data/yelp/r
eview.json`;
SELECT * F
ROM
dfs.yelp.`review.jso
n` LIMIT 1;
USE dfs.y
elp;
SELECT * F
ROM `review.json`
LIMIT 1;
SELECT * F
ROM hbase.users LI
MIT
1;
Storage Plugin Provider
Workspace Table
files Path Path relative to workspace
mongo Database Collection
hive Database Table
hbase Namespace Table
CurrentlySupportedProviders
cassandra Keyspace Table
jdbc Database Table
Future orExperimentalProviders
SELECT * FROM dfs.root.`/E:/drill/data/yelp/review.json`;
SELECT * FROM dfs.yelp.`review.json` LIMIT 1;
USE dfs.yelp; SELECT * FROM `review.json` LIMIT 1; SELECT * FROM hbase.users LIMIT 1;
© 2016 MapR Technologies 8© 2014 MapR Technologies
Let's Do Some Stuff
© 2016 MapR Technologies 9
The Dataset• NYC MTA Bus Times
– Covers three months of 2014– One day per file
• 92 TSV files, over 400 million rows• We will work with one file, entire data set time permitting
– 5.4 million rows– 838MB
• Your cluster already has the data on it.
© 2016 MapR Technologies 10
Your Cluster• Three Drillbits
• URL: http://<public_ip_here>:8047• SSH: ec2-user@<public_ip_here>
• Make sure you use the private key provided.• ssh -i drill_workshop_private_key.openssh ec2-
user@<public_ip>• putty key available
© 2016 MapR Technologies 11
Why Drill for delimited data?• Too large to manipulate in Excel• Maybe you want to join the data with another source?• Maybe you want a code-free ELT workflow?• Leave the original data untouched
© 2016 MapR Technologies 12
Our Goal• Walk through exploration of a single data set
• Prepare the data for low-latency query
• Sub-tasks– Initial Exploration– Remove rows with invalid data– Apply types to the data– Transform to a more storage/performance efficient format– Query it!
© 2016 MapR Technologies 13
Step 0: Get Drill, get the Data
Get Drill 1.6.0 from the closest Apache Mirror:
curl -LO 'http://www.apache.org/dyn/closer.lua?filename=drill/drill-1.6.0/apache-drill-1.6.0.tar.gz&action=download'
Get the dataset from the MTA's S3 bucket:
curl -LO http://s3.amazonaws.com/MTABusTime/AppQuest3/MTA-Bus-Time_.2014-10-31.txt.xz
© 2016 MapR Technologies 14
Step 1: Unpack Drill and Data
• Unpack Drill (suggested: /opt)• Unpack the Data (suggested: /tmp)
© 2016 MapR Technologies 15
Launch sqlline
• Distributed:– sqlline -u jdbc:drill:schema=dfs.mta
• Embedded:– $DRILL_HOME/bin/drill-embedded
• use dfs;
© 2016 MapR Technologies 16
Explore the Dataset
Usually, first you put the files somewhere. Check that Drill can see them.
show files in dfs.tmp;
© 2016 MapR Technologies 17
Explore the Dataset
See if Drill can query the .txt file:
use dfs.tmp;
select * from `MTA-Bus-Time_.2014-10-31.txt` limit 10;
© 2016 MapR Technologies 18
Explore the Dataset
• Drill has a dfs storage plugin• Drill comes with a workspace called tmp that points to /tmp
(maprfs:///tmp on a MapR cluster, /tmp on your laptop)• File has a .txt extension, which Drill does not understand, so the
query fails.
• Your cluster is pre-configured with a new inputFormat, called tsvh - tab separated values, with header.
• Let's rename the file, adding an extension .tsvh then try again.
What just happened there?
© 2016 MapR Technologies 19
Explore the Dataset
See if Drill can query the file with the new .tsvh extension:
use dfs.tmp;
select * from `MTA-Bus-Time_.2014-10-31.txt.tsvh` limit 10;
© 2016 MapR Technologies 20
Explore the Dataset
• Drill has a dfs storage plugin• Drill comes with a workspace called tmp that points to /tmp
(maprfs:///tmp on a MapR cluster, /tmp on your laptop)• Your cluster is pre-configured with a new inputFormat, called tsvh - tab
separated values, with header• Drill used the storage plugin configuration to know how to read the
data, and the inputFormat to figure out how to parse the text based on the extension
• Drill used the first line of each file to generate named columns• Visit your Drillbit's UI, navigate to storage, then click 'Update' next to
dfs to view the plugin configuration.
What just happened there?
© 2016 MapR Technologies 21
What we know so far
• Place your files somewhere Drill can find them.– dfs.tmp is not a bad place to start (it's always there)
• Name your files so Drill can read them– Default dfs plugin inputFormats provide a list of known extensions
• Create your own inputFormats, and include options if you like– extractHeader, skipFirstLine, fieldDelimiter
• Gives rise to custom formats, such as:– CSV without headers (no options)– CSV with unusable headers (skipFirstLine)– Pipe-separated with headers (extractHeader, fieldDelimiter)
© 2016 MapR Technologies 22
Multiple Files
• Many datasets have more than one file.
• To group files together, it's convenient to place them in a workspace.
• Navigate to your drillbit UI, let's look at the pre-created workspace.
© 2016 MapR Technologies 23
The MTA Workspace
"workspaces": { ... "mta": { "location": "/user/ec2-user/data/nyc/mta", "writable": true, "defaultInputFormat": "tsvh" } },
© 2016 MapR Technologies 24
The TSVH format
"format": { ... "tsvh": { "type": "text", "extensions": [ "tsvh" ], "extractHeader": true, "delimiter": "\t" } },
© 2016 MapR Technologies 25
Checkpoint
• So far we have:– Queried a delimited text file via the dfs storage plugin– Used a workspace with some options to make query easier
© 2016 MapR Technologies 26
Schema
• SQL has types, but so far, our data does not.
use dfs.mta;
describe bustime;
© 2016 MapR Technologies 27
Who needs schema?Without Schema> select min(vehicle_id) from bustime;+---------+| EXPR$0 |+---------+| 1001 |+---------+1 row selected (3.238 seconds)
With Schema> select min(cast(vehicle_id as INT)) from bustime;+---------+| EXPR$0 |+---------+| 101 |+---------+1 row selected (3.248 seconds)
© 2016 MapR Technologies 28
Adding Schemaselect cast(latitude as float) latitude, cast(longitude as float) longitude, cast( to_timestamp(time_received,'YYYY-MM-dd HH:mm:ss') as timestamp) as time_received, cast(vehicle_id as int) as vehicle_id, cast(distance_along_trip as float) distance_along_trip, case when inferred_direction_id not in ('NULL', '') then cast(inferred_direction_id as int) end as inferred_direction_id, cast(inferred_phase as VARCHAR(20)) inferred_phase, cast(inferred_route_id as VARCHAR(20)) inferred_route_id, cast(inferred_trip_id as VARCHAR(30)) inferred_trip_id, case when next_scheduled_stop_distance not in ('NULL', '') then cast(next_scheduled_stop_distance as float) end next_scheduled_stop_distance, cast(next_scheduled_stop_id as VARCHAR(20)) as next_scheduled_stop_id from dfs.mta.bustime limit 5;
© 2016 MapR Technologies 29
Adding Schema• But how do you query that? How do you edit the query?
• Subselect? Works, but this is terrible for usability.
select min(vehicle_id) from (select cast(latitude as float) latitude, cast(longitude as float) longitude, cast(to_timestamp(time_received, 'YYYY-MM-dd HH:mm:ss') as timestamp) time_received, cast(vehicle_id as int) as vehicle_id, cast(distance_along_trip as float) distance_along_trip, case when inferred_direction_id not in ('NULL', '') then cast(inferred_direction_id as int) end as inferred_direction_id, cast(inferred_phase as VARCHAR(20)) inferred_phase, cast(inferred_route_id as VARCHAR(20)) inferred_route_id, cast(inferred_trip_id as VARCHAR(30)) inferred_trip_id, case when next_scheduled_stop_distance not in ('NULL', '') then cast(next_scheduled_stop_distance as float) end next_scheduled_stop_distance, cast(next_scheduled_stop_id as VARCHAR(20)) as next_scheduled_stop_id from dfs.mta.bustime limit 5);
© 2016 MapR Technologies 30
Adding Schema• Create a view:
create or replace view bustime_vw as select cast(latitude as float) latitude, cast(longitude as float) longitude, cast(to_timestamp(time_received, 'YYYY-MM-dd HH:mm:ss') as timestamp) time_received, cast(vehicle_id as int) as vehicle_id, cast(distance_along_trip as float) distance_along_trip, case when inferred_direction_id not in ('NULL', '') then cast(inferred_direction_id as int) end as inferred_direction_id, cast(inferred_phase as VARCHAR(20)) inferred_phase, cast(inferred_route_id as VARCHAR(20)) inferred_route_id, cast(inferred_trip_id as VARCHAR(30)) inferred_trip_id, case when next_scheduled_stop_distance not in ('NULL', '') then cast(next_scheduled_stop_distance as float) end next_scheduled_stop_distance, cast(next_scheduled_stop_id as VARCHAR(20)) as next_scheduled_stop_id from dfs.mta.bustime;
© 2016 MapR Technologies 31
Now we have types…0: jdbc:drill:schema=dfs.mta> describe bustime_vw;+-------------------------------+--------------------+--------------+| COLUMN_NAME | DATA_TYPE | IS_NULLABLE |+-------------------------------+--------------------+--------------+| latitude | FLOAT | YES || longitude | FLOAT | YES || time_received | TIMESTAMP | YES || vehicle_id | INTEGER | YES || distance_along_trip | FLOAT | YES || inferred_direction_id | INTEGER | YES || inferred_phase | CHARACTER VARYING | YES || inferred_route_id | CHARACTER VARYING | YES || inferred_trip_id | CHARACTER VARYING | YES || next_scheduled_stop_distance | FLOAT | YES || next_scheduled_stop_id | CHARACTER VARYING | YES |+-------------------------------+--------------------+--------------+11 rows selected (0.131 seconds)
© 2016 MapR Technologies 32
…and simplified queries
> select min(vehicle_id) from bustime_vw;+---------+| EXPR$0 |+---------+| 101 |+---------+1 row selected (4.235 seconds)
© 2016 MapR Technologies 33
But, sometimes we get bad data…The NYC MTA does not, in fact, operate buses off the coast of Africa.
© 2016 MapR Technologies 34
So update the viewFilter out the bad latitudes:
create or replace view bustime_vw as select cast(latitude as float) latitude, cast(longitude as float) longitude, cast( to_timestamp(time_received, 'YYYY-MM-dd HH:mm:ss') as timestamp) time_received, cast(vehicle_id as int) as vehicle_id, cast(distance_along_trip as float) distance_along_trip, case when inferred_direction_id not in ('NULL', '') then cast(inferred_direction_id as int) end as inferred_direction_id, cast(inferred_phase as VARCHAR(20)) inferred_phase, cast(inferred_route_id as VARCHAR(20)) inferred_route_id, cast(inferred_trip_id as VARCHAR(30)) inferred_trip_id, case when next_scheduled_stop_distance not in ('NULL', '') then cast(next_scheduled_stop_distance as float) end next_scheduled_stop_distance, cast(next_scheduled_stop_id as VARCHAR(20)) as next_scheduled_stop_id
from dfs.mta.bustime where latitude > 0.0;
© 2016 MapR Technologies 35
After replacing the viewBetter.
© 2016 MapR Technologies 36
A Simple Aggregation - View0: jdbc:drill:zk=local> select inferred_route_id,count(vehicle_id) layovers from bustime_vw where inferred_phase in ('LAYOVER_BEFORE', 'LAYOVER_DURING') group by inferred_route_id order by layovers desc limit 5;+--------------------+-----------+| inferred_route_id | layovers |+--------------------+-----------+| MTA NYCT_B6 | 7778 || MTA NYCT_B46 | 5751 || MTA NYCT_Q58 | 5276 || MTA NYCT_B41 | 4912 || MTA NYCT_Q46 | 4731 |+--------------------+-----------+
5 rows selected (7.04 seconds)
SLOW!
© 2016 MapR Technologies 37
A Better Format - CTAS to Parquet*0: jdbc:drill:zk=local> create table bustime_pq as select * from bustime_vw;+-----------+----------------------------+| Fragment | Number of records written |+-----------+----------------------------+| 0_0 | 5411159 |+-----------+----------------------------+1 row selected (48.69 seconds)
*Parquet is the default format for CTAS.
© 2016 MapR Technologies 38
A Simple Aggregation - Parquet0: jdbc:drill:zk=local> select inferred_route_id,count(vehicle_id) layovers from bustime_pq where inferred_phase in ('LAYOVER_BEFORE', 'LAYOVER_DURING') group by inferred_route_id order by layovers desc limit 5;+--------------------+-----------+| inferred_route_id | layovers |+--------------------+-----------+| MTA NYCT_B6 | 7778 || MTA NYCT_B46 | 5751 || MTA NYCT_Q58 | 5276 || MTA NYCT_B41 | 4912 || MTA NYCT_Q46 | 4731 |+--------------------+-----------+
5 rows selected (1.959 seconds)
Better!
© 2016 MapR Technologies 39
Checkpoint• So far we have:
– Queried a delimited text file via the dfs storage plugin– Used a workspace with some options to make query easier
– Added types to columns– Created a view, filtering bad data– Transformed the data to parquet for faster query
• Parquet - columnar, compressed
© 2016 MapR Technologies 40
Summary - Workflow for Delimited Text– Drop a sample file in /tmp (local) or maprfs:///tmp or hdfs://tmp– Exploratory queries (SELECT *) via dfs.tmp workspace– Create a workspace for your data pointing to a directory with the entire
data set– Understand the format and set options on inputFormat
• fieldDelimiter (multi-byte delimiters not yet supported)• extractHeader, skipFirstLine
– Add types to columns with CAST– Create a view over the typed columns– Transform the data to parquet for faster query
• Parquet - columnar, compressed
© 2016 MapR Technologies 41
Q & A@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies