49
© 2015 MapR Technologies 1 © 2015 MapR Technologies Rethinking SQL for Big Data with Apache Drill Jim Scott – Director, Enterprise Strategy & Architecture Cisco Data & Analytics Conference

Rethinking SQL for Big Data with Apache Drill

Embed Size (px)

Citation preview

Page 1: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 1© 2015 MapR Technologies

Rethinking SQL for Big Data with Apache Drill

Jim Scott – Director, Enterprise Strategy & ArchitectureCisco Data & Analytics Conference

Page 2: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 2

Topics

• Motivation• Using Drill• SQL + NoSQL = ???• Security Controls• Resources

Page 3: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 3

Empowering “as it happens” businesses by speeding up the

data-to-action cycle

Page 4: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 4© 2015 MapR Technologies

Motivation

Page 5: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 5

SEMI-STRUCTURED DATA

STRUCTURED DATA

1980 2000 20101990 2020

Data is Doubling Every Two Years

Unstructured data will account for more than 80% of the data

collected by organizations

Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data

Total Data S

tored

Page 6: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 6

1980 2000 20101990 2020

Fixed schemaDBA controls structure

Dynamic / Flexible schemaApplication controls structure

NON-RELATIONAL DATASTORESRELATIONAL DATABASES

GBs-TBs TBs-PBsVolume

Database

Data Increasingly Stored in Non-Relational Datastores

Structure

Development

Structured Structured, semi-structured and unstructured

Planned (release cycle = months-years) Iterative (release cycle = days-weeks)

Page 7: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 7

How To Bring SQL to Non-Relational Data Stores?

Familiarity of SQL Agility of NoSQL

• ANSI SQL semantics

• BI (Tableau, MicroStrategy,

etc.)

• Low latency

• No schema management– HDFS (Parquet, JSON, etc.)– HBase– …

• No transformation– No silos of data

• Ease of use

Page 8: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 8

Industry's First Schema-free SQL engine

for Big Data

Page 9: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 9

Enabling “As-It-Happens” Business with Instant Analytics

Hadoop data Data modeling TransformationData

movement(optional)

Users

Hadoop data Users

Traditionalapproach

Exploratory approach

New Business questionsSource data evolution

Total time to insight: weeks to months

Total time to insight: minutes

Page 10: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 10

Evolution Towards Self-Service Data Exploration

Data Modeling and Transformation

Data Visualization

IT-driven

IT-driven

IT-driven

Self-service

IT-driven

Self-service

Optional

Self-service

Traditional BIw/ RDBMS

Self-Service BIw/ RDBMS SQL-on-Hadoop

Self-Service Data Exploration

Zero-day analytics

Page 11: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 11

Common Use Cases

Raw Data Exploration JSON Analytics DWH offload

Hive HBaseFiles Directories…

{JSON}, ParquetText Files …

Page 12: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 12

Drill Supports Schema Discovery On-The-Fly

• Fixed schema• Leverage schema in centralized

repository (Hive Metastore)

• Fixed schema, evolving schema or schema-less

• Leverage schema in centralized repository or self-describing data

2Schema Discovered On-The-FlySchema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY

Page 13: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 13

- Sub-directory- HBase namespace- Hive database

Drill Enables ‘SQL-on-Everything’

SELECT * FROM dfs.yelp.`business.json`

Workspace- Pathnames- Hive table- HBase table

Table

- DFS (Text, Parquet, JSON)- HBase/MapR-DB- Hive Metastore/HCatalog- Easy API to go beyond Hadoop

Storage plugin instance

Page 14: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 14

Drill’s Data Model is Flexible

JSONBSON

HBase

ParquetAvro

CSVTSV

Dynamic schemaFixed schema

Complex

Flat

Flexibility

Name Gender AgeMichael M 6Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Flex

ibili

ty

Page 15: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 15

Reuse Existing SQL Tools and Skills

Leverage SQL-compatible tools (BI, query builders, etc.) via Drill’s standard ODBC, JDBC and ANSI SQL support

Enable business analysts, technical analysts and data scientists to explore and analyze large volumes of real-time data

Page 16: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 16© 2015 MapR Technologies

Using Drill

Page 17: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 17

Business dataset {"business_id": "4bEjOyTaDG24SY5TxsaUNQ","full_address": "3655 Las Vegas Blvd S\nThe Strip\nLas Vegas, NV 89109","hours": {

"Monday": {"close": "23:00", "open": "07:00"},"Tuesday": {"close": "23:00", "open": "07:00"},"Friday": {"close": "00:00", "open": "07:00"},"Wednesday": {"close": "23:00", "open": "07:00"},"Thursday": {"close": "23:00", "open": "07:00"},"Sunday": {"close": "23:00", "open": "07:00"},"Saturday": {"close": "00:00", "open": "07:00"}

},"open": true,"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],"city": "Las Vegas","review_count": 4084,"name": "Mon Ami Gabi","neighborhoods": ["The Strip"],"longitude": -115.172588519464,"state": "NV","stars": 4.0,

"attributes": {"Alcohol": "full_bar”,

"Noise Level": "average","Has TV": false,"Attire": "casual","Ambience": {

"romantic": true,"intimate": false,"touristy": false,"hipster": false,

"classy": true,"trendy": false,

"casual": false},"Good For": {"dessert": false, "latenight": false, "lunch": false,

"dinner": true, "breakfast": false, "brunch": false},}

}

Page 18: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 18

Reviews dataset

{ "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"}

Page 19: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 19

Zero to Results in 2 minutes$ tar -xvzf apache-drill-1.0.0.tar.gz

$ bin/sqlline -u jdbc:drill:zk=local$ bin/drill-embedded> SELECT state, city, count(*) AS businesses FROM dfs.yelp.`business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10;+------------+------------+-------------+| state | city | businesses |+------------+------------+-------------+| NV | Las Vegas | 12021 || AZ | Phoenix | 7499 || AZ | Scottsdale | 3605 || EDH | Edinburgh | 2804 || AZ | Mesa | 2041 || AZ | Tempe | 2025 || NV | Henderson | 1914 || AZ | Chandler | 1637 || WI | Madison | 1630 || AZ | Glendale | 1196 |+------------+------------+-------------+

Install

Query files and

directories

Results

Launch shell (embedded mode)

Page 20: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 20

Directories are implicit partitions

SELECT dir0, SUM(amount)FROM salesGROUP BY dir1 IN (q1, q2)

sales├── 2014│   ├── q1│   ├── q2│   ├── q3│   └── q4└── 2015 └── q1

Page 21: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 21

Intuitive SQL Access to Complex Data// It’s Friday 10pm in Vegas and looking for Hummus

> SELECT name, stars, b.hours.Friday friday, categories FROM dfs.yelp.`business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2;

+------------+------------+------------+------------+| name | stars | friday | categories |+------------+------------+------------+------------+| Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] || Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |+------------+------------+------------+------------+

Query data with any levels of nesting

Page 22: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 22

ANSI SQL Compatibility//Get top cool rated businesses

SELECT b.name from dfs.yelp.`business.json` b WHERE b.business_id IN (SELECT r.business_id FROM dfs.yelp.`review.json` r GROUP BY r.business_id HAVING SUM(r.votes.cool) > 2000 ORDER BY SUM(r.votes.cool) DESC);

+------------+| name |+------------+| Earl of Sandwich || XS Nightclub || The Cosmopolitan of Las Vegas || Wicked Spoon |+------------+

Use familiar SQL functionality

(Joins, Aggregations, Sorting, Sub-queries, SQL data types)

Page 23: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 23

Logical Views //Create a view combining business and reviews datasets

> CREATE OR REPLACE VIEW dfs.tmp.BusinessReviews AS SELECT b.name, b.stars, r.votes.funny, r.votes.useful, r.votes.cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id;+------------+------------+| ok | summary |+------------+------------+| true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema |+------------+------------+

> SELECT COUNT(*) AS Total FROM dfs.tmp.BusinessReviews;+------------+| Total |+------------+| 1125458 |+------------+

Lightweight file system based views for granular

and de-centralized

data management

Page 24: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 24

Materialized Views AKA Tables> ALTER SESSION SET `store.format` = 'parquet';

> CREATE TABLE dfs.yelp.BusinessReviewsTbl AS SELECT b.name, b.stars, r.votes.funny funny, r.votes.useful useful, r.votes.cool cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id;+------------+---------------------------+| Fragment | Number of records written |+------------+---------------------------+| 1_0 | 176448 || 1_1 | 192439 || 1_2 | 198625 || 1_3 | 200863 || 1_4 | 181420 || 1_5 | 175663 |+------------+---------------------------+

Save analysis results as tables using familiar CTAS

syntax

Page 25: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 25

Repeated Values Support// Flatten repeated categories

> SELECT name, categories FROM dfs.yelp.`business.json` LIMIT 3;

+------------+------------+| name | categories |+------------+------------+| Eric Goldberg, MD | ["Doctors","Health & Medical"] || Pine Cone Restaurant | ["Restaurants"] || Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |+------------+------------+

> SELECT name, FLATTEN(categories) AS categories FROM dfs.yelp.`business.json` LIMIT 5;+------------+------------+| name | categories |+------------+------------+| Eric Goldberg, MD | Doctors || Eric Goldberg, MD | Health & Medical || Pine Cone Restaurant | Restaurants || Deforest Family Restaurant | American (Traditional) || Deforest Family Restaurant | Restaurants |+------------+------------+

Dynamically flatten

repeated and nested data elements as part of SQL queries. No ETL necessary

Page 26: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 26

Extensions to ANSI SQL to work with repeated values// Get most common business categories >SELECT category, count(*) AS categorycount FROM (SELECT name, FLATTEN(categories) AS category FROM dfs.yelp.`business.json`) c GROUP BY category ORDER BY categorycount DESC;

+------------+------------+| category | categorycount|+------------+------------+| Restaurants | 14303 |…| Australian | 1 || Boat Dealers | 1 || Firewood | 1 |+------------+------------+

Page 27: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 27

Checkins dataset {    "checkin_info":{       "3-4":1,      "13-5":1,      "6-6":1,      "14-5":1,      "14-6":1,      "14-2":1,      "14-3":1,      "19-0":1,      "11-5":1,      "13-2":1,      "11-6":2,      "11-3":1,      "12-6":1,      "6-5":1,      "5-5":1,      "9-2":1,      "9-5":1,      "9-6":1,      "5-2":1,      "7-6":1,      "7-5":1,      "7-4":1,      "17-5":1,      "8-5":1,      "10-2":1,      "10-5":1,      "10-6":1   },   "type":"checkin",   "business_id":"JwUE5GmEO-sH1FuwJgKBlQ"}

Page 28: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 28

Supports Dynamic / Unknown Columns> SELECT KVGEN(checkin_info) checkins FROM dfs.yelp.`checkin.json` LIMIT 1;+------------+| checkins |+------------+| [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14-5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19-0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11-3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9-2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7-6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8-5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] |+------------+

> SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.yelp.`checkin.json` limit 6;+------------+| checkins |+------------+| {"key":"3-4","value":1} || {"key":"13-5","value":1} || {"key":"6-6","value":1} || {"key":"14-5","value":1} || {"key":"14-6","value":1} || {"key":"14-2","value":1} |+------------+

Convert Map with a wide set of dynamic columns into an array of key-value pairs

Page 29: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 29

Makes it easy to work with dynamic/unknown columns// Count total number of checkins on Sunday midnight

> SELECT SUM(checkintbl.checkins.`value`) as SundayMidnightCheckins FROM (SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.yelp.checkin.json`) checkintbl WHERE checkintbl.checkins.key='23-0';

+------------------------+| SundayMidnightCheckins |+------------------------+| 8575 |+------------------------+

Page 30: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 30© 2015 MapR Technologies

SQL + NoSQL = Accessible & Linearly Scalable

Page 31: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 31

180 Tables NOT SHOWN!

Page 32: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 32

MusicBrainz on NoSQL• Artists, albums, tracks and labels are key objects• Reality check:

– Add works (compositions), recordings, release, release group• 7 tables for artist alone• 12 for place, 7 for label, 17 for release/group, 8 for work

– (but only 4 for recording!)– Total of 12 + 7 + 17 + 8 + 4 = 48 tables

• But wait, there’s more!– 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link

tables, 5 cover art tables and 3 tables for CD timing info (138 total)– And 50 more tables that aren’t documented yet

Page 33: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 33

236 tablesto describe 7 kinds of things

Page 34: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 34

Page 35: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 35

Page 36: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 36

Further Reductions• All 86 link tables become properties on artists, releases and

other entities• All 44 tag, rating and annotation tables become list properties• All 5 cover art tables become lists of file references

• Current score: 162 tables become 4

• You get the idea

Page 37: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 37

Is This Good?• Expressivity

– The JSON data model is at least as expressive as the original relational model

• Many cases easier to describe in nested data• No cases are harder

• Efficiency– Inlining can increase data size. Locality improves, however– Sessionizing can substantially decrease data size– Inlining back-references is more efficient than ordinary indexes– Inlined columnar data allows 1000x speedup for time series

• Introspection (you decide)

Page 38: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 38

Searching for Elvis// Find discs where Elvis was credited > SELECT distinct album_id, name FROM

(SELECT id album_id, artist_id, name, FLATTEN(credit) FROM release) albums

join (SELECT distinct artist_id FROM

(SELECT id artist_id, FLATTEN(alias) FROM artistwhere name like 'Elvis%Presley’)

) artists USING artist_id;

Page 39: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 39

Benefits• Extended relational model allows massive simplification

– On a real example, we see >20x reduction in number of tables

• Simplification drives improved introspection– This is good

• Apache Drill gives very high performance execution for extended relational problems

• You can try this out today

Page 40: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 40© 2015 MapR Technologies

Security Controls

Page 41: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 41

Access Controls that Scale

PAM Authentication + User Impersonation

Fine-grained row and column level access control with Drill Views – no centralized security repository required

Files HBase Hive

Drill View 1

Drill View 2

UUU

U

U

Page 42: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 42

Granular Security via Drill Views

Name City State Credit Card #Dave San Jose CA 1374-7914-3865-4817

John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)OwnerAdmins

Permission Admins

Business Analyst Data Scientist

Name City State Credit Card #

Dave San Jose

CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist View (/views/maskedcards.csv)

Not a physical data copy

Name City State

Dave San Jose

CA

John Boulder CO

Business Analyst View

OwnerAdmins

Permission Business Analysts

OwnerAdmins

Permission Data

Scientists

Page 43: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 43

Ownership ChainingCombine Self Service Exploration with Data Governance

Name City State Credit Card #

Dave San Jose CA 1374-7914-3865-4817

John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)

Name City State Credit Card #

Dave San Jose CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist (/views/V_Scientist)

Jane (Read)John (Owner)

Name City State

Dave San Jose CA

John Boulder CO

Analyst(/views/V_Analyst)

Jack (Read)Jane(Owner)

RAW

FILEV

_Scientist

V_A

nalyst

Does Jack have access to V_Analyst? ->YES

Who is the owner of V_Analyst? ->Jane

Drill accesses V_Analyst as Jane (Impersonation hop 1)

Does Jane have access to V_Scientist ? -> YES

Who is the owner of V_Scientist? ->John

Drill accesses V_Scientist as John (Impersonation hop 2)

John(Owner)

Does John have permissions on raw file? -> YES

Who is the owner of raw file? ->John

Drill accesses source file as John (no impersonation here)

Jack queries the view V_Analyst

*Ownership chain length (# hops) is configurable

Ownership chaining

Access path

Page 44: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 44

Security SummaryLogical

– No physical data copies/silos

Granular– Row level and column level security controls

De-centralized– User impersonation respecting storage system permissions

– No separate permission repository for granular controls

– Integrated with Hadoop File System permissions and LDAP

Self-service w/ governance– If you have access to data, you control who and how widely can access it

– Audits

Page 45: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 45© 2015 MapR Technologies

Wrapping it up

Page 46: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 46

Drill is Top-Ranked SQL-on-Hadoop

Source: Gigaom Research, 2015

Key: • Number indicates companies relative strength across all vectors• Size of ball indicates company’s relative strength along individual vector

“Drill isn’t just about

SQL-on-Hadoop.

It’s about SQL-on-

pretty-much-

anything,

immediately, and

without formality.”

Page 47: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 47

Page 49: Rethinking SQL for Big Data with Apache Drill

© 2015 MapR Technologies 49

Q & A@kingmesal maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies