© 2015 MapR Technologies 1© 2015 MapR Technologies
Evolving from RDBMS to NoSQL + SQL
Jim Scott – Director, Enterprise Strategy & Architecture
@kingmesal #strataconf
© 2015 MapR Technologies 2
Why Does this Matter
• 90%+ of the use cases do not deal with “relational” data
• RDBMS data models are more complex than a single table– One-to-many relationships require multiple tables
– Creating code to persist data takes time and QA
• Inferred (or removed) keys used without actual foreign keys– Difficult for others to understand relationships
• Transactional tables never look the same as analytics tables– OLTP -> ETL -> OLAP
– This takes significant time to build
© 2015 MapR Technologies 3
Topics
• Changing Data Models– Relations Model to JSON Model
• A New Database for JSON Data– Document Database (OJAI)
• Querying JSON Data and More– Drill
• Resources
© 2015 MapR Technologies 4
Empowering “as it happens” businesses by speeding up the
data-to-action cycle
© 2015 MapR Technologies 5© 2015 MapR Technologies
Changing Data Models
© 2015 MapR Technologies 6
180 Tables NOT SHOWN!
© 2015 MapR Technologies 7
236 tablesto describe 7 kinds of things
© 2015 MapR Technologies 8
© 2015 MapR Technologies 9
© 2015 MapR Technologies 10
Searching for Elvis// Find discs where Elvis was credited > SELECT distinct album_id, name FROM
(SELECT id album_id, artist_id, name, FLATTEN(credit) FROM release) albums
join (SELECT distinct artist_id FROM
(SELECT id artist_id, FLATTEN(alias) FROM artistwhere name like 'Elvis%Presley’)
) artists USING artist_id;
© 2015 MapR Technologies 11
Benefits
• Extended relational model allows massive simplification– On a real example, we see >20x reduction in number of tables
• Simplification drives improved introspection– This is good
• Apache Drill gives very high performance execution for extended relational problems
• You can try this out today
© 2015 MapR Technologies 12© 2015 MapR Technologies
A New Database for JSON Data
© 2015 MapR Technologies 13
Basics of the API
• http://ojai.github.io/
• Entry point to a table - DocumentStore– insert()– insertOrReplace()– find()– delete()– replace()– update()– increment()
© 2015 MapR Technologies 14
Working with JSON in Java
• Step 1 – Create instance of JSON Serializer
Gson gson = new Gson();
• Step 2 – Serialize POJO to JSON
String json = gson.toJson(myObject);
• Step 3 – Deserialize JSON into POJO
MyObject myObject = gson.fromJson(json, MyObject.class);
© 2015 MapR Technologies 15
Creating Documents in Java OJAI
• Use static methods on class org.ojai.json.Json
Document doc = Json.newDocument(myObject);
Document doc = Json.newDocument(jsonString);
• Alternatively– Use builders– Stream from disk– Use InputStream
© 2015 MapR Technologies 16
Creating New Documents
• DocumentStore.insert(doc)
Done!
• DocumentStore.insertOrReplace(doc)
Done!
Easy right?
© 2015 MapR Technologies 17
Updating Existing Documents
• DocumentStore.update(_id, DocumentMutation)
• Mutation methods– mutation.append(FieldPath, “user visited URL”);– mutation.set(“field.name”, “What a great example”);– mutation.increment(“field”, 1);– mutation.merge(“field”, Map<String, Object>);– mutation.setOrReplace(…);– mutation.delete(field);
Yes, these are atomic.
© 2015 MapR Technologies 18
Deleting Documents
• DocumentStore.delete(doc);
Done!
• DocumentStore.delete(_id);
Done!
This is easy too, right?
© 2015 MapR Technologies 19
Finding Documents
• DocumentStore.find(QueryCondition);
• Query condition setup:– qc.is(“field”, EQUAL, “blue”)
.and().notExists(“other.field”)
.or().like(“field”, “%purple”)
.or().matches(“another.field”, “regular expression”)
© 2015 MapR Technologies 20© 2015 MapR Technologies
Querying JSON Data and More
© 2015 MapR Technologies 21
How To Bring SQL to Non-Relational Data Stores?
Familiarity of SQL Agility of NoSQL
• ANSI SQL semantics
• BI (Tableau, MicroStrategy,
etc.)
• Low latency
• No schema management– HDFS (Parquet, JSON, etc.)– HBase– …
• No transformation– No silos of data
• Ease of use
© 2015 MapR Technologies 22
Drill Supports Schema Discovery On-The-Fly
• Fixed schema• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or schema-less
• Leverage schema in centralized repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON WRITE
SCHEMA BEFORE READ
SCHEMA ON THE FLY
© 2015 MapR Technologies 23
Drill’s Data Model is Flexible
JSONBSON
HBase
ParquetAvro
CSVTSV
Dynamic schema
Fixed schema
Complex
Flat
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}
RDBMS/SQL-on-Hadoop table
Apache Drill table
Fle
xibi
lity
© 2015 MapR Technologies 24
Enabling “As-It-Happens” Business with Instant Analytics
Hadoop data Data modeling TransformationData
movement
(optional)Users
Hadoop data Users
Traditionalapproach
Exploratory approach
New Business questionsSource data evolution
Total time to insight: weeks to months
Total time to insight: minutes
© 2015 MapR Technologies 25
Evolution Towards Self-Service Data Exploration
Data Modeling and Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Optional
Self-service
Traditional BIw/ RDBMS
Self-Service BIw/ RDBMS
SQL-on-HadoopSelf-Service
Data Exploration
Zero-day analytics
© 2015 MapR Technologies 26
Common Use Cases
Raw Data Exploration JSON Analytics DWH offload
Hive HBaseFiles Directories…
{JSON}, ParquetText Files …
© 2015 MapR Technologies 27
- Sub-directory- HBase namespace- Hive database
Drill Enables ‘SQL-on-Everything’
SELECT * FROM dfs.yelp.`business.json`
Workspace- Pathnames- Hive table- HBase table
Table
- DFS (Text, Parquet, JSON)- HBase/MapR-DB- Hive Metastore/HCatalog- Easy API to go beyond Hadoop
Storage plugin instance
© 2015 MapR Technologies 28
Reuse Existing SQL Tools and Skills
Leverage SQL-compatible tools
(BI, query builders, etc.) via Drill’s
standard ODBC, JDBC and ANSI
SQL support
Enable business analysts, technical
analysts and data scientists to
explore and analyze large volumes
of real-time data
© 2015 MapR Technologies 29© 2015 MapR Technologies
Security Controls
© 2015 MapR Technologies 30
Access Controls that Scale
PAM Authentication + User Impersonation
Fine-grained row and column level access control with Drill Views – no centralized security repository required
Files HBase Hive
Drill View 1
Drill View 2
UUU
U
U
© 2015 MapR Technologies 31
Granular Security via Drill Views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)OwnerAdmins
Permission Admins
Business Analyst Data Scientist
Name City State Credit Card #
Dave San Jose
CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.csv)
Not a physical data copy
Name City State
Dave San Jose
CA
John Boulder CO
Business Analyst View
OwnerAdmins
Permission Business Analysts
OwnerAdmins
Permission Data
Scientists
© 2015 MapR Technologies 32
Ownership ChainingCombine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)Jane(Owner)
RA
W F
ILEV
_Scientist
V_A
nalyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership chaining
Access path
© 2015 MapR Technologies 33
Security Summary
• Logical
– No physical data copies/silos
• Granular
– Row level and column level security controls
• De-centralized
– User impersonation respecting storage system permissions
– No separate permission repository for granular controls
– Integrated with Hadoop File System permissions and LDAP
• Self-service w/ governance
– If you have access to data, you control who and how widely can access it
– Audits
© 2015 MapR Technologies 34© 2015 MapR Technologies
Using Drill with Yelp
© 2015 MapR Technologies 35
Business dataset {"business_id": "4bEjOyTaDG24SY5TxsaUNQ","full_address": "3655 Las Vegas Blvd S\nThe Strip\nLas Vegas, NV 89109","hours": {
"Monday": {"close": "23:00", "open": "07:00"},"Tuesday": {"close": "23:00", "open": "07:00"},"Friday": {"close": "00:00", "open": "07:00"},"Wednesday": {"close": "23:00", "open": "07:00"},"Thursday": {"close": "23:00", "open": "07:00"},"Sunday": {"close": "23:00", "open": "07:00"},"Saturday": {"close": "00:00", "open": "07:00"}
},"open": true,"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],"city": "Las Vegas","review_count": 4084,"name": "Mon Ami Gabi","neighborhoods": ["The Strip"],"longitude": -115.172588519464,"state": "NV","stars": 4.0,
"attributes": {"Alcohol": "full_bar”,
"Noise Level": "average","Has TV": false,"Attire": "casual","Ambience": {
"romantic": true,"intimate": false,"touristy": false,"hipster": false,
"classy": true,"trendy": false,
"casual": false},"Good For": {"dessert": false, "latenight": false, "lunch": false,
"dinner": true, "breakfast": false, "brunch": false},}
}
© 2015 MapR Technologies 36
Zero to Results in 2 minutes$ tar -xvzf apache-drill-1.0.0.tar.gz
$ bin/sqlline -u jdbc:drill:zk=local$ bin/drill-embedded
> SELECT state, city, count(*) AS businesses FROM dfs.yelp.`business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10;+------------+------------+-------------+| state | city | businesses |+------------+------------+-------------+| NV | Las Vegas | 12021 || AZ | Phoenix | 7499 || AZ | Scottsdale | 3605 || EDH | Edinburgh | 2804 || AZ | Mesa | 2041 || AZ | Tempe | 2025 || NV | Henderson | 1914 || AZ | Chandler | 1637 || WI | Madison | 1630 || AZ | Glendale | 1196 |+------------+------------+-------------+
Install
Query files and
directories
Results
Launch shell (embedded mode)
© 2015 MapR Technologies 37
Directories are implicit partitions
SELECT dir0, SUM(amount)FROM salesGROUP BY dir1 IN (q1, q2)
sales├── 2014│ ├── q1│ ├── q2│ ├── q3│ └── q4└── 2015 └── q1
© 2015 MapR Technologies 38
Intuitive SQL Access to Complex Data// It’s Friday 10pm in Vegas and looking for Hummus
> SELECT name, stars, b.hours.Friday friday, categories FROM dfs.yelp.`business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2;
+------------+------------+------------+------------+| name | stars | friday | categories |+------------+------------+------------+------------+| Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] || Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |+------------+------------+------------+------------+
Query data with any levels of nesting
© 2015 MapR Technologies 39
Reviews dataset
{ "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"}
© 2015 MapR Technologies 40
ANSI SQL Compatibility
//Get top cool rated businesses
SELECT b.name from dfs.yelp.`business.json` b WHERE b.business_id IN (SELECT r.business_id FROM dfs.yelp.`review.json` r GROUP BY r.business_id HAVING SUM(r.votes.cool) > 2000 ORDER BY SUM(r.votes.cool) DESC);
+------------+| name |+------------+| Earl of Sandwich || XS Nightclub || The Cosmopolitan of Las Vegas || Wicked Spoon |+------------+
Use familiar SQL functionality
(Joins, Aggregations, Sorting, Sub-queries, SQL data types)
© 2015 MapR Technologies 41
Logical Views //Create a view combining business and reviews datasets
> CREATE OR REPLACE VIEW dfs.tmp.BusinessReviews AS SELECT b.name, b.stars, r.votes.funny, r.votes.useful, r.votes.cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id;
+------------+------------+| ok | summary |+------------+------------+| true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema |+------------+------------+
> SELECT COUNT(*) AS Total FROM dfs.tmp.BusinessReviews;+------------+| Total |+------------+| 1125458 |+------------+
Lightweight file system based views for granular
and de-centralized
data management
© 2015 MapR Technologies 42
Materialized Views AKA Tables> ALTER SESSION SET `store.format` = 'parquet';
> CREATE TABLE dfs.yelp.BusinessReviewsTbl AS SELECT b.name, b.stars, r.votes.funny funny, r.votes.useful useful, r.votes.cool cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id;
+------------+---------------------------+| Fragment | Number of records written |+------------+---------------------------+| 1_0 | 176448 || 1_1 | 192439 || 1_2 | 198625 || 1_3 | 200863 || 1_4 | 181420 || 1_5 | 175663 |+------------+---------------------------+
Save analysis results as tables using familiar CTAS
syntax
© 2015 MapR Technologies 43
Repeated Values Support// Flatten repeated categories
> SELECT name, categories FROM dfs.yelp.`business.json` LIMIT 3;
+------------+------------+| name | categories |+------------+------------+| Eric Goldberg, MD | ["Doctors","Health & Medical"] || Pine Cone Restaurant | ["Restaurants"] || Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |+------------+------------+
> SELECT name, FLATTEN(categories) AS categories FROM dfs.yelp.`business.json` LIMIT 5;+------------+------------+| name | categories |+------------+------------+| Eric Goldberg, MD | Doctors || Eric Goldberg, MD | Health & Medical || Pine Cone Restaurant | Restaurants || Deforest Family Restaurant | American (Traditional) || Deforest Family Restaurant | Restaurants |+------------+------------+
Dynamically flatten
repeated and nested data elements as part of SQL queries. No ETL necessary
© 2015 MapR Technologies 44
Checkins dataset { "checkin_info":{ "3-4":1, "13-5":1, "6-6":1, "14-5":1, "14-6":1, "14-2":1, "14-3":1, "19-0":1, "11-5":1, "13-2":1, "11-6":2, "11-3":1, "12-6":1, "6-5":1, "5-5":1, "9-2":1, "9-5":1, "9-6":1, "5-2":1, "7-6":1, "7-5":1, "7-4":1, "17-5":1, "8-5":1, "10-2":1, "10-5":1, "10-6":1 }, "type":"checkin", "business_id":"JwUE5GmEO-sH1FuwJgKBlQ"}
© 2015 MapR Technologies 45
Supports Dynamic / Unknown Columns> SELECT KVGEN(checkin_info) checkins FROM dfs.yelp.`checkin.json` LIMIT 1;+------------+| checkins |+------------+| [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14-5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19-0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11-3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9-2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7-6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8-5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] |+------------+
> SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.yelp.`checkin.json` limit 6;
+------------+| checkins |+------------+| {"key":"3-4","value":1} || {"key":"13-5","value":1} || {"key":"6-6","value":1} || {"key":"14-5","value":1} || {"key":"14-6","value":1} || {"key":"14-2","value":1} |+------------+
Convert Map with a wide set of dynamic columns into an array of key-value pairs
© 2015 MapR Technologies 46© 2015 MapR Technologies
Resources
© 2015 MapR Technologies 47
© 2015 MapR Technologies 48
Drill is Top-Ranked SQL-on-Hadoop
Source: Gigaom Research, 2015
Key: • Number indicates companies relative strength across all vectors• Size of ball indicates company’s relative strength along individual vector
“Drill isn’t just about
SQL-on-Hadoop.
It’s about SQL-on-
pretty-much-
anything,
immediately, and
without formality.”
© 2015 MapR Technologies 49
OJAI and MapR-DB
Where to find it…– The source: https://github.com/ojai/ojai– The site: http://ojai.github.io/
– Python bindings: https://github.com/mapr-demos/python-bindings– Javascript bindings: https://github.com/mapr-demos/js-bindings
Ready to play with your data?– Download the sandbox: http://maprdb.io– Examples:
• Java: https://github.com/mapr-demos/maprdb-ojai-101• Python: https://github.com/mapr-demos/maprdb_python_examples
© 2015 MapR Technologies 50
Drill Walkthrough
• Example queries• Conversion from relational model to flat JSON model
https://www.mapr.com/blog/drilling-healthy-choices
https://www.mapr.com/blog/evolution-database-schemas-using-sql-nosql
© 2015 MapR Technologies 51
Recommendations for Getting Started with Drill
New to Drill?– Get started with Free MapR On Demand training – Test Drive Drill on cloud with AWS– Learn how to use Drill with Hadoop using MapR sandbox
Ready to play with your data?– Try out Apache Drill in 10 mins guide on your desktop– Download Drill for your cluster and start exploration– Comprehensive tutorials and documentation available
Ask questions – [email protected]
© 2015 MapR Technologies 52
Q & A
@kingmesal maprtech
Engage with us!
MapR
maprtech
mapr-technologies