Drilling into Data with Apache Drill

Drilling into Data with Apache DrillTomer Shiran, Apache Drill Founder and PMC MemberJacques Nadeau, Apache Drill PMC Chair

Tomer Shiran Jacques [email protected] [email protected]

@tshiran @intjesus

Drill founder and PMC MemberMapR VP Product

Drill PMC Chair (VP, Apache Drill)

mailto:[email protected]


Apache Drill

• Open source SQL query engine for non-relational datastores– JSON document model– Columnar

• Key advantages:– Query any non-relational datastore– No overhead (creating and maintaining schemas, transforming

data, …)– Treat your data like a table even when it’s not– Keep using the BI tools you love– Scales from one laptop to 1000s of servers– Great performance and scalability

Omni-SQL (“SQL-on-Everything”)

Drill: Omni-SQLWhereas the other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements.

Andrew Brust,

“”

Any Non-Relational Datastore

• File systems– Traditional: Local files and NAS– Hadoop: HDFS and MapR-FS– Cloud storage: Amazon S3, Google

Cloud Storage, Azure Blob Storage

• NoSQL databases– MongoDB– HBase– MapR-DB– Hive

• And you can add new datastores

Any Client• Multiple interfaces: ODBC, JDBC, REST, C,

Java• BI tools

– Tableau– Qlik– MicroStrategy– TIBCO Spotfire– Excel

• Command line (Drill shell)• Web and mobile apps

– Many JSON-powered chart libraries (see D3.js)

• SAS, R, …

Drill Integrates With What You Have

Achieving “End-to-End Performance”

Execute fast• Standard SQL• Read data fast• Leverage columnar

encodings and execution

• Execute operations quickly

• Scale out, not up

Iterate fast• Work without prep• Decentralize data

management• In-situ security• Explore + query• Access multiple

sources• Avoid the ETL rinse

cycle

JSON Model, Columnar Speed

JSONBSONMongo

HBaseNoSQL

ParquetAvro

CSVTSV

Schema-lessFixed

schema

Flat

Complex

Name Gender Age

Michael M 6

Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Apache Drill Provides the Best of Both Worlds

Acts Like a Database• ANSI SQL: SELECT, FROM, WHERE,

JOIN, HAVING, ORDER BY, WITH, CTAS, ALL, EXISTS, ANY, IN, SOME

• VarChar, Int, BigInt, Decimal, VarBinary, Timestamp, Float, Double, etc.

• Subqueries, scalar subqueries, partition pruning, CTE

• Data warehouse offload• Tableau, ODBC, JDBC• TPC-H & TPC-DS-like workloads• Supports Hive SerDes• Supports Hive UDFs• Supports Hive Metastore

Even When Your Data Doesn’t

• Path based queries and wildcards– select * from /my/logs/– select * from /revenue/*/q2

• Modern data types– Map, Array, Any

• Complex Functions and Relational Operators– FLATTEN, kvgen, convert_from,

convert_to, repeated_count, etc

• JSON Sensor analytics• Complex data analysis• Alternative DSLs

Why? To Support the Changing Data Organization

Data Dev Circa 20001. Developer comes up with

requirements2. DBA defines tables3. DBA defines indices4. DBA defines FK relationships5. Developer stores data6. BI builds reports7. Analyst views reports8. DBA adds materialized

views

Data Today1. Developer builds app, defines

schema, stores data2. Analyst queries data3. Data engineer fixes

performance problems or fills functionality gaps

HOW DOES IT WORK?

Everything Starts With a Drillbit…• High performance query executor• In-memory columnar execution• Directly interacts with data, acquiring

knowledge as it reads• Built to leverage large amounts of

memory• Networked or not• Exposes ODBC, JDBC, REST• Built-in Web UI and CLI• Extensible

Drillbit

Single process (daemon or CLI)

Data Lake, More Like Data Maelstrom

HDFS HDFSmongod mongod

HDFS HDFS

HBase HBaseCassandra Cassandra

HDFS

HDFS

HBaseWindows Desktop

Mac Desktop

HBase & HDFS Cluster

HDFS ClusterMongoDB Cluster

Cassandra Cluster

DesktopClustered Servers

Run Drillbits Wherever; Whatever Your Data

Drillbit

HDFS HDFSmongod mongod

HDFS HDFS

HBase HBase

Drillbit

DrillbitDrillbitDrillbit Drillbit

Cassandra Cassandra

Drillbit Drillbit

HDFS

HDFS

HBase

Drillbit

Drillbit

Windows Desktop

Drillbit

Mac Desktop

Drillbit

Connect to Any Drillbit with ODBC, JDBC, C, Java, REST

1. User connects to Drillbit2. That Drillbit becomes Foreman

– Foreman generates execution plan – Cost-based query optimization &

locality

3. Execution fragments are farmed to other Drillbits

4. Drillbits exchange data as necessary to guarantee relational algebra

5. Results are returned to user through Foreman

Drillbit

User

Drillbit

Drillbit(foreman)

ANALYZING YELP DATA

1. DOWNLOAD AND INSTALL DRILL

Run Drill in Embedded Mode (drill-embedded)

$ tar xf apache-drill-1.0.0.tar.gz

$ cd apache-drill-1.0.0

$ bin/drill-embedded

> SELECT * FROM dfs.root.`/Users/tshiran/yelp/user.json` LIMIT 1;+----------------+----------------------------------+---------------+-------+| yelping_since | votes | review_count | name |+----------------+----------------------------------+---------------+-------+| 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee |+----------------+----------------------------------+---------------+-------+

• drillbit (Drill daemon) starts automatically in embedded mode• No ZooKeeper in embedded mode• Web UI is available at localhost:8047

Review the Query Profile in the Web UI (localhost:8047)

Run Drill in Distributed Mode$ zkServer start # ZooKeeper maintains the list of drillbits in the cluster

$ bin/drillbit.sh start # conf/drill-override.conf includes cluster name and ZK nodes

$ bin/drill-conf # or bin/drill-localhost to skip ZK lookup

> SELECT stars, count(*) FROM dfs.root.`/Users/tshiran/yelp/review.json` GROUP BY stars ORDER BY stars;+--------+---------+| stars | EXPR$1 |+--------+---------+| 1 | 110772 || 2 | 102737 || 3 | 163761 || 4 | 342143 || 5 | 406045 |+--------+---------+5 rows selected (3.739 seconds)

2. CONFIGURE DATASTORES (STORAGE PLUGINS)

Enable MongoDB Storage Plugin

Define Workspaces in the File Storage Plugin

3. EXPLORE THE DATA

The Data: Files

{ "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"}

The Data: MongoDB Collections$ mongoMongoDB shell version: 2.6.5> show databases;admin (empty)local 0.078GByelp 0.453GB> use yelp> db.users.findOne(){

"_id" : ObjectId("54566cdf3237149de181a92a"),"yelping_since" : "2012-02","votes" : {

"funny" : 1,"useful" : 5,"cool" : 0

},"review_count" : 6,"name" : "Lee","user_id" : "qtrmBGNqCvupHMHL_bKFgQ","friends" : [ ]

}

Are There More 5-Star or 1-Star Reviews?> SELECT stars, count(*) FROM dfs.root.`/Users/tshiran/yelp/review.json` GROUP BY stars ORDER BY stars;+--------+---------+| stars | EXPR$1 |+--------+---------+| 1 | 110772 || 2 | 102737 || 3 | 163761 || 4 | 342143 || 5 | 406045 |+--------+---------+5 rows selected (3.739 seconds)

Using Storage Plugins and Workspaces

> SELECT * FROM dfs.root.`/Users/tshiran/data/yelp/review.json` LIMIT 1;> SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1;> SELECT * FROM mongo.yelp.users LIMIT 1;> USE mongo.yelp;> SELECT * FROM users LIMIT 1;

Storage pluginWorkspace

Path relative to workspace

Storage Plugin Workspace Table

dfs Path Path relative to workspace

mongo Database Collection

hive Database Table

hbase Namespace Table

Most Common User Names (MongoDB)> SELECT name, count(*) AS users FROM mongo.yelp.users GROUP BY name ORDER BY users DESC LIMIT 10;+------------+------------+| name | users |+------------+------------+| David | 2453 || John | 2378 || Michael | 2322 || Chris | 2202 || Mike | 2037 || Jennifer | 1867 || Jessica | 1463 || Jason | 1457 || Michelle | 1439 || Brian | 1436 |+------------+------------+

Cities with the Most Businesses> SELECT state, city, count(*) AS businesses FROM dfs.demo.`/yelp/business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10;+------------+------------+-------------+| state | city | businesses |+------------+------------+-------------+| NV | Las Vegas | 12021 || AZ | Phoenix | 7499 || AZ | Scottsdale | 3605 || EDH | Edinburgh | 2804 || AZ | Mesa | 2041 || AZ | Tempe | 2025 || NV | Henderson | 1914 || AZ | Chandler | 1637 || WI | Madison | 1630 || AZ | Glendale | 1196 |+------------+------------+-------------+

3. EXPLORING COMPLEX DATA

business.json (1){

"business_id": "4bEjOyTaDG24SY5TxsaUNQ","full_address": "3655 Las Vegas Blvd S\nThe Strip\nLas Vegas, NV 89109","hours": {

"Monday": {"close": "23:00", "open": "07:00"},"Tuesday": {"close": "23:00", "open": "07:00"},"Friday": {"close": "00:00", "open": "07:00"},"Wednesday": {"close": "23:00", "open": "07:00"},"Thursday": {"close": "23:00", "open": "07:00"},"Sunday": {"close": "23:00", "open": "07:00"},"Saturday": {"close": "00:00", "open": "07:00"}

},"open": true,"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],"city": "Las Vegas","review_count": 4084,"name": "Mon Ami Gabi","neighborhoods": ["The Strip"],"longitude": -115.172588519464,

business.json (2)"state": "NV","stars": 4.0,

"attributes": {"Alcohol": "full_bar”,

"Noise Level": "average","Has TV": false,"Attire": "casual","Ambience": {

"romantic": true,"intimate": false,"touristy": false,"hipster": false,

"classy": true,"trendy": false,

"casual": false},"Good For": {"dessert": false, "latenight": false, "lunch": false,

"dinner": true, "breakfast": false, "brunch": false},}

}

Which Places Are Open Right Now (22:00)?> SELECT name, b.hours FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Saturday.`open` < '22:00' AND b.hours.Saturday.`close` > '22:00' LIMIT 2;

+------------------------------+------------------------------------------------+| name | hours |+------------------------------+------------------------------------------------+| Chang Jiang Chinese Kitchen | {"Saturday":{"close":"22:30","open":"11:00"}} || Grand China Restaurant | {"Saturday":{"close":"23:00","open":"11:00"}} |+------------------------------+------------------------------------------------+

It’s 10pm in Vegas and I Want Good Hummus!

> SELECT name, b.hours.Friday AS friday, categories FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2;

+--------------------------------+-----------------------------------+--------------------------------------------------------------+| name | friday | categories |+--------------------------------+-----------------------------------+--------------------------------------------------------------+| Olives | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] || Marrakech Moroccan Restaurant | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |+--------------------------------+-----------------------------------+--------------------------------------------------------------+

Flatten Repeated Values> SELECT name, categories FROM dfs.demo.`yelp/business.json` LIMIT 3;+-----------------------------+-------------------------------------------+| name | categories |+-----------------------------+-------------------------------------------+| Eric Goldberg, MD | ["Doctors","Health & Medical"] || Pine Cone Restaurant | ["Restaurants"] || Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |+-----------------------------+-------------------------------------------+

> SELECT name, FLATTEN(categories) AS categories FROM dfs.demo.`yelp/business.json` LIMIT 5;+-----------------------------+-------------------------+| name | categories |+-----------------------------+-------------------------+| Eric Goldberg, MD | Doctors || Eric Goldberg, MD | Health & Medical || Pine Cone Restaurant | Restaurants || Deforest Family Restaurant | American (Traditional) || Deforest Family Restaurant | Restaurants |+-----------------------------+-------------------------+

Most and Least Common Business Categories

> SELECT category, count(*) AS businesses FROM (SELECT name, FLATTEN(categories) AS category FROM dfs.demo.`yelp/business.json`) c GROUP BY category ORDER BY businesses DESC;+-----------------------------------+-------------+| category | businesses |+-----------------------------------+-------------+| Restaurants | 14303 || Shopping | 6428 |…| Australian | 1 || Boat Dealers | 1 || Firewood | 1 |+-----------------------------------+-------------+715 rows selected (3.439 seconds)

> SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and REPEATED_CONTAINS(categories, 'Australian');+------+------------+| name | categories |+------+------------+| The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] |+------+------------+

4. LEVERAGING VIEWS

Create a View for Name-Gender Mapping

> CREATE VIEW dfs.tmp.`names` AS SELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`;> USE dfs.tmp;> CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`;> SELECT * FROM dfs.tmp.names WHERE name = 'John';+------------+------------+| name | gender |+------------+------------+| John | Male |+------------+------------+

columns[0]

columns[4]

names.csv:

Most Common Names (and their Genders) on Yelp

> SELECT u.name, n.gender, count(*) AS number FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY u.name, n.gender ORDER BY number DESC LIMIT 10;+------------+------------+------------+| name | gender | number |+------------+------------+------------+| David | Male | 2453 || John | Male | 2378 || Michael | Male | 2322 || Chris | Unknown | 2202 || Mike | Male | 2037 || Jennifer | Female | 1867 || Jessica | Female | 1463 || Jason | Male | 1457 || Michelle | Female | 1439 || Brian | Male | 1436 |+------------+------------+------------+

Who Rates Higher – Men or Women?> SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY n.gender;+------------+------------+------------+| gender | users | stars |+------------+------------+------------+| Female | 103684 | 3.77 || Male | 97430 | 3.696 || Unknown | 18409 | 3.727 |+------------+------------+------------+

Who Writes Longer Reviews – Men or Women?

> SELECT n.gender, round(avg(length(r.text))) AS review_length FROM dfs.demo.`yelp/review.json` r, mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name AND r.user_id = u.user_id GROUP BY n.gender;+------------+---------------+| gender | review_length |+------------+---------------+| Male | 665 || Female | 730 || Unknown | 711 |+------------+---------------+

It takes a 3-way join to find out…

Thank You!

• Download at drill.apache.org

• Get in touch:• [email protected]• [email protected]

• Ask questions:• [email protected]

• Tweet: @ApacheDrill

http://drill.apache.org/




Documents

Drilling into Data with Apache Drill