42
Drilling into Data with Apache Drill Tomer Shiran, Apache Drill Founder and PMC Member Jacques Nadeau, Apache Drill PMC Chair

Drilling into Data with Apache Drill

Embed Size (px)

Citation preview

Page 1: Drilling into Data with Apache Drill

Drilling into Data with Apache DrillTomer Shiran, Apache Drill Founder and PMC MemberJacques Nadeau, Apache Drill PMC Chair

Page 2: Drilling into Data with Apache Drill

Tomer Shiran Jacques [email protected] [email protected]

@tshiran @intjesus

Drill founder and PMC MemberMapR VP Product

Drill PMC Chair (VP, Apache Drill)

Page 3: Drilling into Data with Apache Drill

Apache Drill

• Open source SQL query engine for non-relational datastores– JSON document model– Columnar

• Key advantages:– Query any non-relational datastore– No overhead (creating and maintaining schemas, transforming

data, …)– Treat your data like a table even when it’s not– Keep using the BI tools you love– Scales from one laptop to 1000s of servers– Great performance and scalability

Page 4: Drilling into Data with Apache Drill

Omni-SQL (“SQL-on-Everything”)

Drill: Omni-SQLWhereas the other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements.

Andrew Brust,

“”

Page 5: Drilling into Data with Apache Drill

Any Non-Relational Datastore

• File systems– Traditional: Local files and NAS– Hadoop: HDFS and MapR-FS– Cloud storage: Amazon S3, Google

Cloud Storage, Azure Blob Storage

• NoSQL databases– MongoDB– HBase– MapR-DB– Hive

• And you can add new datastores

Any Client• Multiple interfaces: ODBC, JDBC, REST, C,

Java• BI tools

– Tableau– Qlik– MicroStrategy– TIBCO Spotfire– Excel

• Command line (Drill shell)• Web and mobile apps

– Many JSON-powered chart libraries (see D3.js)

• SAS, R, …

Drill Integrates With What You Have

Page 6: Drilling into Data with Apache Drill

Achieving “End-to-End Performance”

Execute fast• Standard SQL• Read data fast• Leverage columnar

encodings and execution

• Execute operations quickly

• Scale out, not up

Iterate fast• Work without prep• Decentralize data

management• In-situ security• Explore + query• Access multiple

sources• Avoid the ETL rinse

cycle

Page 7: Drilling into Data with Apache Drill

JSON Model, Columnar Speed

JSONBSONMongo

HBaseNoSQL

ParquetAvro

CSVTSV

Schema-lessFixed

schema

Flat

Complex

Name Gender Age

Michael M 6

Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Page 8: Drilling into Data with Apache Drill

Apache Drill Provides the Best of Both Worlds

Acts Like a Database• ANSI SQL: SELECT, FROM, WHERE,

JOIN, HAVING, ORDER BY, WITH, CTAS, ALL, EXISTS, ANY, IN, SOME

• VarChar, Int, BigInt, Decimal, VarBinary, Timestamp, Float, Double, etc.

• Subqueries, scalar subqueries, partition pruning, CTE

• Data warehouse offload• Tableau, ODBC, JDBC• TPC-H & TPC-DS-like workloads• Supports Hive SerDes• Supports Hive UDFs• Supports Hive Metastore

Even When Your Data Doesn’t

• Path based queries and wildcards– select * from /my/logs/– select * from /revenue/*/q2

• Modern data types– Map, Array, Any

• Complex Functions and Relational Operators– FLATTEN, kvgen, convert_from,

convert_to, repeated_count, etc

• JSON Sensor analytics• Complex data analysis• Alternative DSLs

Page 9: Drilling into Data with Apache Drill

Why? To Support the Changing Data Organization

Data Dev Circa 20001. Developer comes up with

requirements2. DBA defines tables3. DBA defines indices4. DBA defines FK relationships5. Developer stores data6. BI builds reports7. Analyst views reports8. DBA adds materialized

views

Data Today1. Developer builds app, defines

schema, stores data2. Analyst queries data3. Data engineer fixes

performance problems or fills functionality gaps

Page 10: Drilling into Data with Apache Drill

HOW DOES IT WORK?

Page 11: Drilling into Data with Apache Drill

Everything Starts With a Drillbit…• High performance query executor• In-memory columnar execution• Directly interacts with data, acquiring

knowledge as it reads• Built to leverage large amounts of

memory• Networked or not• Exposes ODBC, JDBC, REST• Built-in Web UI and CLI• Extensible

Drillbit

Single process (daemon or CLI)

Page 12: Drilling into Data with Apache Drill

Data Lake, More Like Data Maelstrom

HDFS HDFSmongod mongod

HDFS HDFS

HBase HBaseCassandra Cassandra

HDFS

HDFS

HBaseWindows Desktop

Mac Desktop

HBase & HDFS Cluster

HDFS ClusterMongoDB Cluster

Cassandra Cluster

DesktopClustered Servers

Page 13: Drilling into Data with Apache Drill

Run Drillbits Wherever; Whatever Your Data

Drillbit

HDFS HDFSmongod mongod

HDFS HDFS

HBase HBase

Drillbit

DrillbitDrillbitDrillbit Drillbit

Cassandra Cassandra

Drillbit Drillbit

HDFS

HDFS

HBase

Drillbit

Drillbit

Windows Desktop

Drillbit

Mac Desktop

Drillbit

Page 14: Drilling into Data with Apache Drill

Connect to Any Drillbit with ODBC, JDBC, C, Java, REST

1. User connects to Drillbit2. That Drillbit becomes Foreman

– Foreman generates execution plan – Cost-based query optimization &

locality

3. Execution fragments are farmed to other Drillbits

4. Drillbits exchange data as necessary to guarantee relational algebra

5. Results are returned to user through Foreman

Drillbit

User

Drillbit

Drillbit(foreman)

Page 15: Drilling into Data with Apache Drill

ANALYZING YELP DATA

Page 16: Drilling into Data with Apache Drill

1. DOWNLOAD AND INSTALL DRILL

Page 17: Drilling into Data with Apache Drill

Run Drill in Embedded Mode (drill-embedded)

$ tar xf apache-drill-1.0.0.tar.gz

$ cd apache-drill-1.0.0

$ bin/drill-embedded

> SELECT * FROM dfs.root.`/Users/tshiran/yelp/user.json` LIMIT 1;+----------------+----------------------------------+---------------+-------+| yelping_since | votes | review_count | name |+----------------+----------------------------------+---------------+-------+| 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee |+----------------+----------------------------------+---------------+-------+

• drillbit (Drill daemon) starts automatically in embedded mode• No ZooKeeper in embedded mode• Web UI is available at localhost:8047

Page 18: Drilling into Data with Apache Drill

Review the Query Profile in the Web UI (localhost:8047)

Page 19: Drilling into Data with Apache Drill

Run Drill in Distributed Mode$ zkServer start # ZooKeeper maintains the list of drillbits in the cluster

$ bin/drillbit.sh start # conf/drill-override.conf includes cluster name and ZK nodes

$ bin/drill-conf # or bin/drill-localhost to skip ZK lookup

> SELECT stars, count(*) FROM dfs.root.`/Users/tshiran/yelp/review.json` GROUP BY stars ORDER BY stars;+--------+---------+| stars | EXPR$1 |+--------+---------+| 1 | 110772 || 2 | 102737 || 3 | 163761 || 4 | 342143 || 5 | 406045 |+--------+---------+5 rows selected (3.739 seconds)

Page 20: Drilling into Data with Apache Drill

2. CONFIGURE DATASTORES (STORAGE PLUGINS)

Page 21: Drilling into Data with Apache Drill

Enable MongoDB Storage Plugin

Page 22: Drilling into Data with Apache Drill

Define Workspaces in the File Storage Plugin

Page 23: Drilling into Data with Apache Drill

3. EXPLORE THE DATA

Page 24: Drilling into Data with Apache Drill

The Data: Files

{ "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"}

Page 25: Drilling into Data with Apache Drill

The Data: MongoDB Collections$ mongoMongoDB shell version: 2.6.5> show databases;admin (empty)local 0.078GByelp 0.453GB> use yelp> db.users.findOne(){

"_id" : ObjectId("54566cdf3237149de181a92a"),"yelping_since" : "2012-02","votes" : {

"funny" : 1,"useful" : 5,"cool" : 0

},"review_count" : 6,"name" : "Lee","user_id" : "qtrmBGNqCvupHMHL_bKFgQ","friends" : [ ]

}

Page 26: Drilling into Data with Apache Drill

Are There More 5-Star or 1-Star Reviews?> SELECT stars, count(*) FROM dfs.root.`/Users/tshiran/yelp/review.json` GROUP BY stars ORDER BY stars;+--------+---------+| stars | EXPR$1 |+--------+---------+| 1 | 110772 || 2 | 102737 || 3 | 163761 || 4 | 342143 || 5 | 406045 |+--------+---------+5 rows selected (3.739 seconds)

Page 27: Drilling into Data with Apache Drill

Using Storage Plugins and Workspaces

> SELECT * FROM dfs.root.`/Users/tshiran/data/yelp/review.json` LIMIT 1;> SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1;> SELECT * FROM mongo.yelp.users LIMIT 1;> USE mongo.yelp;> SELECT * FROM users LIMIT 1;

Storage pluginWorkspace

Path relative to workspace

Storage Plugin Workspace Table

dfs Path Path relative to workspace

mongo Database Collection

hive Database Table

hbase Namespace Table

Page 28: Drilling into Data with Apache Drill

Most Common User Names (MongoDB)> SELECT name, count(*) AS users FROM mongo.yelp.users GROUP BY name ORDER BY users DESC LIMIT 10;+------------+------------+| name | users |+------------+------------+| David | 2453 || John | 2378 || Michael | 2322 || Chris | 2202 || Mike | 2037 || Jennifer | 1867 || Jessica | 1463 || Jason | 1457 || Michelle | 1439 || Brian | 1436 |+------------+------------+

Page 29: Drilling into Data with Apache Drill

Cities with the Most Businesses> SELECT state, city, count(*) AS businesses FROM dfs.demo.`/yelp/business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10;+------------+------------+-------------+| state | city | businesses |+------------+------------+-------------+| NV | Las Vegas | 12021 || AZ | Phoenix | 7499 || AZ | Scottsdale | 3605 || EDH | Edinburgh | 2804 || AZ | Mesa | 2041 || AZ | Tempe | 2025 || NV | Henderson | 1914 || AZ | Chandler | 1637 || WI | Madison | 1630 || AZ | Glendale | 1196 |+------------+------------+-------------+

Page 30: Drilling into Data with Apache Drill

3. EXPLORING COMPLEX DATA

Page 31: Drilling into Data with Apache Drill

business.json (1){

"business_id": "4bEjOyTaDG24SY5TxsaUNQ","full_address": "3655 Las Vegas Blvd S\nThe Strip\nLas Vegas, NV 89109","hours": {

"Monday": {"close": "23:00", "open": "07:00"},"Tuesday": {"close": "23:00", "open": "07:00"},"Friday": {"close": "00:00", "open": "07:00"},"Wednesday": {"close": "23:00", "open": "07:00"},"Thursday": {"close": "23:00", "open": "07:00"},"Sunday": {"close": "23:00", "open": "07:00"},"Saturday": {"close": "00:00", "open": "07:00"}

},"open": true,"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],"city": "Las Vegas","review_count": 4084,"name": "Mon Ami Gabi","neighborhoods": ["The Strip"],"longitude": -115.172588519464,

Page 32: Drilling into Data with Apache Drill

business.json (2)"state": "NV","stars": 4.0,

"attributes": {"Alcohol": "full_bar”,

"Noise Level": "average","Has TV": false,"Attire": "casual","Ambience": {

"romantic": true,"intimate": false,"touristy": false,"hipster": false,

"classy": true,"trendy": false,

"casual": false},"Good For": {"dessert": false, "latenight": false, "lunch": false,

"dinner": true, "breakfast": false, "brunch": false},}

}

Page 33: Drilling into Data with Apache Drill

Which Places Are Open Right Now (22:00)?> SELECT name, b.hours FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Saturday.`open` < '22:00' AND b.hours.Saturday.`close` > '22:00' LIMIT 2;

+------------------------------+------------------------------------------------+| name | hours |+------------------------------+------------------------------------------------+| Chang Jiang Chinese Kitchen | {"Saturday":{"close":"22:30","open":"11:00"}} || Grand China Restaurant | {"Saturday":{"close":"23:00","open":"11:00"}} |+------------------------------+------------------------------------------------+

Page 34: Drilling into Data with Apache Drill

It’s 10pm in Vegas and I Want Good Hummus!

> SELECT name, b.hours.Friday AS friday, categories FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2;

+--------------------------------+-----------------------------------+--------------------------------------------------------------+| name | friday | categories |+--------------------------------+-----------------------------------+--------------------------------------------------------------+| Olives | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] || Marrakech Moroccan Restaurant | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |+--------------------------------+-----------------------------------+--------------------------------------------------------------+

Page 35: Drilling into Data with Apache Drill

Flatten Repeated Values> SELECT name, categories FROM dfs.demo.`yelp/business.json` LIMIT 3;+-----------------------------+-------------------------------------------+| name | categories |+-----------------------------+-------------------------------------------+| Eric Goldberg, MD | ["Doctors","Health & Medical"] || Pine Cone Restaurant | ["Restaurants"] || Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |+-----------------------------+-------------------------------------------+

> SELECT name, FLATTEN(categories) AS categories FROM dfs.demo.`yelp/business.json` LIMIT 5;+-----------------------------+-------------------------+| name | categories |+-----------------------------+-------------------------+| Eric Goldberg, MD | Doctors || Eric Goldberg, MD | Health & Medical || Pine Cone Restaurant | Restaurants || Deforest Family Restaurant | American (Traditional) || Deforest Family Restaurant | Restaurants |+-----------------------------+-------------------------+

Page 36: Drilling into Data with Apache Drill

Most and Least Common Business Categories

> SELECT category, count(*) AS businesses FROM (SELECT name, FLATTEN(categories) AS category FROM dfs.demo.`yelp/business.json`) c GROUP BY category ORDER BY businesses DESC;+-----------------------------------+-------------+| category | businesses |+-----------------------------------+-------------+| Restaurants | 14303 || Shopping | 6428 |…| Australian | 1 || Boat Dealers | 1 || Firewood | 1 |+-----------------------------------+-------------+715 rows selected (3.439 seconds)

> SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and REPEATED_CONTAINS(categories, 'Australian');+------+------------+| name | categories |+------+------------+| The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] |+------+------------+

Page 37: Drilling into Data with Apache Drill

4. LEVERAGING VIEWS

Page 38: Drilling into Data with Apache Drill

Create a View for Name-Gender Mapping

> CREATE VIEW dfs.tmp.`names` AS SELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`;> USE dfs.tmp;> CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`;> SELECT * FROM dfs.tmp.names WHERE name = 'John';+------------+------------+| name | gender |+------------+------------+| John | Male |+------------+------------+

columns[0]

columns[4]

names.csv:

Page 39: Drilling into Data with Apache Drill

Most Common Names (and their Genders) on Yelp

> SELECT u.name, n.gender, count(*) AS number FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY u.name, n.gender ORDER BY number DESC LIMIT 10;+------------+------------+------------+| name | gender | number |+------------+------------+------------+| David | Male | 2453 || John | Male | 2378 || Michael | Male | 2322 || Chris | Unknown | 2202 || Mike | Male | 2037 || Jennifer | Female | 1867 || Jessica | Female | 1463 || Jason | Male | 1457 || Michelle | Female | 1439 || Brian | Male | 1436 |+------------+------------+------------+

Page 40: Drilling into Data with Apache Drill

Who Rates Higher – Men or Women?> SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY n.gender;+------------+------------+------------+| gender | users | stars |+------------+------------+------------+| Female | 103684 | 3.77 || Male | 97430 | 3.696 || Unknown | 18409 | 3.727 |+------------+------------+------------+

Page 41: Drilling into Data with Apache Drill

Who Writes Longer Reviews – Men or Women?

> SELECT n.gender, round(avg(length(r.text))) AS review_length FROM dfs.demo.`yelp/review.json` r, mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name AND r.user_id = u.user_id GROUP BY n.gender;+------------+---------------+| gender | review_length |+------------+---------------+| Male | 665 || Female | 730 || Unknown | 711 |+------------+---------------+

It takes a 3-way join to find out…

Page 42: Drilling into Data with Apache Drill

Thank You!

• Download at drill.apache.org

• Get in touch:• [email protected][email protected]

• Ask questions:• [email protected]

• Tweet: @ApacheDrill