Upload
mapr-technologies
View
1.363
Download
1
Embed Size (px)
Citation preview
© 2015 MapR Technologies 1© 2015 MapR Technologies
Rethinking SQL for Big Data with Apache Drill
Jim Scott – Director, Enterprise Strategy & ArchitectureCisco Data & Analytics Conference
© 2015 MapR Technologies 2
Topics
• Motivation• Using Drill• SQL + NoSQL = ???• Security Controls• Resources
© 2015 MapR Technologies 3
Empowering “as it happens” businesses by speeding up the
data-to-action cycle
© 2015 MapR Technologies 4© 2015 MapR Technologies
Motivation
© 2015 MapR Technologies 5
SEMI-STRUCTURED DATA
STRUCTURED DATA
1980 2000 20101990 2020
Data is Doubling Every Two Years
Unstructured data will account for more than 80% of the data
collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
Total Data S
tored
© 2015 MapR Technologies 6
1980 2000 20101990 2020
Fixed schemaDBA controls structure
Dynamic / Flexible schemaApplication controls structure
NON-RELATIONAL DATASTORESRELATIONAL DATABASES
GBs-TBs TBs-PBsVolume
Database
Data Increasingly Stored in Non-Relational Datastores
Structure
Development
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
© 2015 MapR Technologies 7
How To Bring SQL to Non-Relational Data Stores?
Familiarity of SQL Agility of NoSQL
• ANSI SQL semantics
• BI (Tableau, MicroStrategy,
etc.)
• Low latency
• No schema management– HDFS (Parquet, JSON, etc.)– HBase– …
• No transformation– No silos of data
• Ease of use
© 2015 MapR Technologies 8
Industry's First Schema-free SQL engine
for Big Data
© 2015 MapR Technologies 9
Enabling “As-It-Happens” Business with Instant Analytics
Hadoop data Data modeling TransformationData
movement(optional)
Users
Hadoop data Users
Traditionalapproach
Exploratory approach
New Business questionsSource data evolution
Total time to insight: weeks to months
Total time to insight: minutes
© 2015 MapR Technologies 10
Evolution Towards Self-Service Data Exploration
Data Modeling and Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Optional
Self-service
Traditional BIw/ RDBMS
Self-Service BIw/ RDBMS SQL-on-Hadoop
Self-Service Data Exploration
Zero-day analytics
© 2015 MapR Technologies 11
Common Use Cases
Raw Data Exploration JSON Analytics DWH offload
Hive HBaseFiles Directories…
{JSON}, ParquetText Files …
© 2015 MapR Technologies 12
Drill Supports Schema Discovery On-The-Fly
• Fixed schema• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or schema-less
• Leverage schema in centralized repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON WRITE
SCHEMA BEFORE READ
SCHEMA ON THE FLY
© 2015 MapR Technologies 13
- Sub-directory- HBase namespace- Hive database
Drill Enables ‘SQL-on-Everything’
SELECT * FROM dfs.yelp.`business.json`
Workspace- Pathnames- Hive table- HBase table
Table
- DFS (Text, Parquet, JSON)- HBase/MapR-DB- Hive Metastore/HCatalog- Easy API to go beyond Hadoop
Storage plugin instance
© 2015 MapR Technologies 14
Drill’s Data Model is Flexible
JSONBSON
HBase
ParquetAvro
CSVTSV
Dynamic schemaFixed schema
Complex
Flat
Flexibility
Name Gender AgeMichael M 6Jennifer F 3
{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flex
ibili
ty
© 2015 MapR Technologies 15
Reuse Existing SQL Tools and Skills
Leverage SQL-compatible tools (BI, query builders, etc.) via Drill’s standard ODBC, JDBC and ANSI SQL support
Enable business analysts, technical analysts and data scientists to explore and analyze large volumes of real-time data
© 2015 MapR Technologies 16© 2015 MapR Technologies
Using Drill
© 2015 MapR Technologies 17
Business dataset {"business_id": "4bEjOyTaDG24SY5TxsaUNQ","full_address": "3655 Las Vegas Blvd S\nThe Strip\nLas Vegas, NV 89109","hours": {
"Monday": {"close": "23:00", "open": "07:00"},"Tuesday": {"close": "23:00", "open": "07:00"},"Friday": {"close": "00:00", "open": "07:00"},"Wednesday": {"close": "23:00", "open": "07:00"},"Thursday": {"close": "23:00", "open": "07:00"},"Sunday": {"close": "23:00", "open": "07:00"},"Saturday": {"close": "00:00", "open": "07:00"}
},"open": true,"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],"city": "Las Vegas","review_count": 4084,"name": "Mon Ami Gabi","neighborhoods": ["The Strip"],"longitude": -115.172588519464,"state": "NV","stars": 4.0,
"attributes": {"Alcohol": "full_bar”,
"Noise Level": "average","Has TV": false,"Attire": "casual","Ambience": {
"romantic": true,"intimate": false,"touristy": false,"hipster": false,
"classy": true,"trendy": false,
"casual": false},"Good For": {"dessert": false, "latenight": false, "lunch": false,
"dinner": true, "breakfast": false, "brunch": false},}
}
© 2015 MapR Technologies 18
Reviews dataset
{ "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"}
© 2015 MapR Technologies 19
Zero to Results in 2 minutes$ tar -xvzf apache-drill-1.0.0.tar.gz
$ bin/sqlline -u jdbc:drill:zk=local$ bin/drill-embedded> SELECT state, city, count(*) AS businesses FROM dfs.yelp.`business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10;+------------+------------+-------------+| state | city | businesses |+------------+------------+-------------+| NV | Las Vegas | 12021 || AZ | Phoenix | 7499 || AZ | Scottsdale | 3605 || EDH | Edinburgh | 2804 || AZ | Mesa | 2041 || AZ | Tempe | 2025 || NV | Henderson | 1914 || AZ | Chandler | 1637 || WI | Madison | 1630 || AZ | Glendale | 1196 |+------------+------------+-------------+
Install
Query files and
directories
Results
Launch shell (embedded mode)
© 2015 MapR Technologies 20
Directories are implicit partitions
SELECT dir0, SUM(amount)FROM salesGROUP BY dir1 IN (q1, q2)
sales├── 2014│ ├── q1│ ├── q2│ ├── q3│ └── q4└── 2015 └── q1
© 2015 MapR Technologies 21
Intuitive SQL Access to Complex Data// It’s Friday 10pm in Vegas and looking for Hummus
> SELECT name, stars, b.hours.Friday friday, categories FROM dfs.yelp.`business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2;
+------------+------------+------------+------------+| name | stars | friday | categories |+------------+------------+------------+------------+| Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] || Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |+------------+------------+------------+------------+
Query data with any levels of nesting
© 2015 MapR Technologies 22
ANSI SQL Compatibility//Get top cool rated businesses
SELECT b.name from dfs.yelp.`business.json` b WHERE b.business_id IN (SELECT r.business_id FROM dfs.yelp.`review.json` r GROUP BY r.business_id HAVING SUM(r.votes.cool) > 2000 ORDER BY SUM(r.votes.cool) DESC);
+------------+| name |+------------+| Earl of Sandwich || XS Nightclub || The Cosmopolitan of Las Vegas || Wicked Spoon |+------------+
Use familiar SQL functionality
(Joins, Aggregations, Sorting, Sub-queries, SQL data types)
© 2015 MapR Technologies 23
Logical Views //Create a view combining business and reviews datasets
> CREATE OR REPLACE VIEW dfs.tmp.BusinessReviews AS SELECT b.name, b.stars, r.votes.funny, r.votes.useful, r.votes.cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id;+------------+------------+| ok | summary |+------------+------------+| true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema |+------------+------------+
> SELECT COUNT(*) AS Total FROM dfs.tmp.BusinessReviews;+------------+| Total |+------------+| 1125458 |+------------+
Lightweight file system based views for granular
and de-centralized
data management
© 2015 MapR Technologies 24
Materialized Views AKA Tables> ALTER SESSION SET `store.format` = 'parquet';
> CREATE TABLE dfs.yelp.BusinessReviewsTbl AS SELECT b.name, b.stars, r.votes.funny funny, r.votes.useful useful, r.votes.cool cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id;+------------+---------------------------+| Fragment | Number of records written |+------------+---------------------------+| 1_0 | 176448 || 1_1 | 192439 || 1_2 | 198625 || 1_3 | 200863 || 1_4 | 181420 || 1_5 | 175663 |+------------+---------------------------+
Save analysis results as tables using familiar CTAS
syntax
© 2015 MapR Technologies 25
Repeated Values Support// Flatten repeated categories
> SELECT name, categories FROM dfs.yelp.`business.json` LIMIT 3;
+------------+------------+| name | categories |+------------+------------+| Eric Goldberg, MD | ["Doctors","Health & Medical"] || Pine Cone Restaurant | ["Restaurants"] || Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |+------------+------------+
> SELECT name, FLATTEN(categories) AS categories FROM dfs.yelp.`business.json` LIMIT 5;+------------+------------+| name | categories |+------------+------------+| Eric Goldberg, MD | Doctors || Eric Goldberg, MD | Health & Medical || Pine Cone Restaurant | Restaurants || Deforest Family Restaurant | American (Traditional) || Deforest Family Restaurant | Restaurants |+------------+------------+
Dynamically flatten
repeated and nested data elements as part of SQL queries. No ETL necessary
© 2015 MapR Technologies 26
Extensions to ANSI SQL to work with repeated values// Get most common business categories >SELECT category, count(*) AS categorycount FROM (SELECT name, FLATTEN(categories) AS category FROM dfs.yelp.`business.json`) c GROUP BY category ORDER BY categorycount DESC;
+------------+------------+| category | categorycount|+------------+------------+| Restaurants | 14303 |…| Australian | 1 || Boat Dealers | 1 || Firewood | 1 |+------------+------------+
© 2015 MapR Technologies 27
Checkins dataset { "checkin_info":{ "3-4":1, "13-5":1, "6-6":1, "14-5":1, "14-6":1, "14-2":1, "14-3":1, "19-0":1, "11-5":1, "13-2":1, "11-6":2, "11-3":1, "12-6":1, "6-5":1, "5-5":1, "9-2":1, "9-5":1, "9-6":1, "5-2":1, "7-6":1, "7-5":1, "7-4":1, "17-5":1, "8-5":1, "10-2":1, "10-5":1, "10-6":1 }, "type":"checkin", "business_id":"JwUE5GmEO-sH1FuwJgKBlQ"}
© 2015 MapR Technologies 28
Supports Dynamic / Unknown Columns> SELECT KVGEN(checkin_info) checkins FROM dfs.yelp.`checkin.json` LIMIT 1;+------------+| checkins |+------------+| [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14-5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19-0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11-3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9-2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7-6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8-5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] |+------------+
> SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.yelp.`checkin.json` limit 6;+------------+| checkins |+------------+| {"key":"3-4","value":1} || {"key":"13-5","value":1} || {"key":"6-6","value":1} || {"key":"14-5","value":1} || {"key":"14-6","value":1} || {"key":"14-2","value":1} |+------------+
Convert Map with a wide set of dynamic columns into an array of key-value pairs
© 2015 MapR Technologies 29
Makes it easy to work with dynamic/unknown columns// Count total number of checkins on Sunday midnight
> SELECT SUM(checkintbl.checkins.`value`) as SundayMidnightCheckins FROM (SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.yelp.checkin.json`) checkintbl WHERE checkintbl.checkins.key='23-0';
+------------------------+| SundayMidnightCheckins |+------------------------+| 8575 |+------------------------+
© 2015 MapR Technologies 30© 2015 MapR Technologies
SQL + NoSQL = Accessible & Linearly Scalable
© 2015 MapR Technologies 31
180 Tables NOT SHOWN!
© 2015 MapR Technologies 32
MusicBrainz on NoSQL• Artists, albums, tracks and labels are key objects• Reality check:
– Add works (compositions), recordings, release, release group• 7 tables for artist alone• 12 for place, 7 for label, 17 for release/group, 8 for work
– (but only 4 for recording!)– Total of 12 + 7 + 17 + 8 + 4 = 48 tables
• But wait, there’s more!– 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link
tables, 5 cover art tables and 3 tables for CD timing info (138 total)– And 50 more tables that aren’t documented yet
© 2015 MapR Technologies 33
236 tablesto describe 7 kinds of things
© 2015 MapR Technologies 34
© 2015 MapR Technologies 35
© 2015 MapR Technologies 36
Further Reductions• All 86 link tables become properties on artists, releases and
other entities• All 44 tag, rating and annotation tables become list properties• All 5 cover art tables become lists of file references
• Current score: 162 tables become 4
• You get the idea
© 2015 MapR Technologies 37
Is This Good?• Expressivity
– The JSON data model is at least as expressive as the original relational model
• Many cases easier to describe in nested data• No cases are harder
• Efficiency– Inlining can increase data size. Locality improves, however– Sessionizing can substantially decrease data size– Inlining back-references is more efficient than ordinary indexes– Inlined columnar data allows 1000x speedup for time series
• Introspection (you decide)
© 2015 MapR Technologies 38
Searching for Elvis// Find discs where Elvis was credited > SELECT distinct album_id, name FROM
(SELECT id album_id, artist_id, name, FLATTEN(credit) FROM release) albums
join (SELECT distinct artist_id FROM
(SELECT id artist_id, FLATTEN(alias) FROM artistwhere name like 'Elvis%Presley’)
) artists USING artist_id;
© 2015 MapR Technologies 39
Benefits• Extended relational model allows massive simplification
– On a real example, we see >20x reduction in number of tables
• Simplification drives improved introspection– This is good
• Apache Drill gives very high performance execution for extended relational problems
• You can try this out today
© 2015 MapR Technologies 40© 2015 MapR Technologies
Security Controls
© 2015 MapR Technologies 41
Access Controls that Scale
PAM Authentication + User Impersonation
Fine-grained row and column level access control with Drill Views – no centralized security repository required
Files HBase Hive
Drill View 1
Drill View 2
UUU
U
U
© 2015 MapR Technologies 42
Granular Security via Drill Views
Name City State Credit Card #Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)OwnerAdmins
Permission Admins
Business Analyst Data Scientist
Name City State Credit Card #
Dave San Jose
CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.csv)
Not a physical data copy
Name City State
Dave San Jose
CA
John Boulder CO
Business Analyst View
OwnerAdmins
Permission Business Analysts
OwnerAdmins
Permission Data
Scientists
© 2015 MapR Technologies 43
Ownership ChainingCombine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)Jane(Owner)
RAW
FILEV
_Scientist
V_A
nalyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership chaining
Access path
© 2015 MapR Technologies 44
Security SummaryLogical
– No physical data copies/silos
Granular– Row level and column level security controls
De-centralized– User impersonation respecting storage system permissions
– No separate permission repository for granular controls
– Integrated with Hadoop File System permissions and LDAP
Self-service w/ governance– If you have access to data, you control who and how widely can access it
– Audits
© 2015 MapR Technologies 45© 2015 MapR Technologies
Wrapping it up
© 2015 MapR Technologies 46
Drill is Top-Ranked SQL-on-Hadoop
Source: Gigaom Research, 2015
Key: • Number indicates companies relative strength across all vectors• Size of ball indicates company’s relative strength along individual vector
“Drill isn’t just about
SQL-on-Hadoop.
It’s about SQL-on-
pretty-much-
anything,
immediately, and
without formality.”
© 2015 MapR Technologies 47
© 2015 MapR Technologies 48
Recommendations for Getting Started with DrillNew to Drill?
– Get started with Free MapR On Demand training – Test Drive Drill on cloud with AWS– Learn how to use Drill with Hadoop using MapR sandbox
Ready to play with your data?– Try out Apache Drill in 10 mins guide on your desktop– Download Drill for your cluster and start exploration– Comprehensive tutorials and documentation available
Ask questions – [email protected]
© 2015 MapR Technologies 49
Q & A@kingmesal maprtech
Engage with us!
MapR
maprtech
mapr-technologies