52
1 © 2016 MapR Technologies 1 © 2016 MapR Technologies Evolving from RDBMS to NoSQL + SQL Jim Scott @kingmesal #strataconf

Evolving from RDBMS to NoSQL + SQL

Embed Size (px)

Citation preview

Page 1: Evolving from RDBMS to NoSQL + SQL

1© 2016 MapR Technologies 1© 2016 MapR Technologies

Evolving from RDBMS to NoSQL + SQLJim Scott@kingmesal #strataconf

Page 2: Evolving from RDBMS to NoSQL + SQL

2© 2016 MapR Technologies 2

Why Does this Matter

• 90%+ of the use cases do not deal with “relational” data• RDBMS data models are more complex than a single table

– One-to-many relationships require multiple tables– Creating code to persist data takes time and QA

• Inferred (or removed) keys used without actual foreign keys– Difficult for others to understand relationships

• Transactional tables never look the same as analytics tables– OLTP -> ETL -> OLAP– This takes significant time to build

Page 3: Evolving from RDBMS to NoSQL + SQL

3© 2016 MapR Technologies 3

Topics

• Changing Data Models– Relations Model to JSON Model

• A New Database for JSON Data– Document Database (OJAI)

• Querying JSON Data and More– Drill

• Resources

Page 4: Evolving from RDBMS to NoSQL + SQL

4© 2016 MapR Technologies 4

Empowering “as it happens” businesses by speeding up the

data-to-action cycle

Page 5: Evolving from RDBMS to NoSQL + SQL

5© 2016 MapR Technologies 5© 2016 MapR Technologies© 2016 MapR Technologies

Changing Data Models

Page 6: Evolving from RDBMS to NoSQL + SQL

6© 2016 MapR Technologies 6

180 Tables NOT SHOWN!

Page 7: Evolving from RDBMS to NoSQL + SQL

7© 2016 MapR Technologies 7

236 tablesto describe 7 kinds of things

Page 8: Evolving from RDBMS to NoSQL + SQL

8© 2016 MapR Technologies 8

Page 9: Evolving from RDBMS to NoSQL + SQL

9© 2016 MapR Technologies 9

Page 10: Evolving from RDBMS to NoSQL + SQL

10© 2016 MapR Technologies 10

Searching for Elvis// Find discs where Elvis was credited > SELECT distinct album_id, name FROM

(SELECT id album_id, artist_id, name, FLATTEN(credit) FROM release) albums

join (SELECT distinct artist_id FROM

(SELECT id artist_id, FLATTEN(alias) FROM artistwhere name like 'Elvis%Presley’)

) artists USING artist_id;

Page 11: Evolving from RDBMS to NoSQL + SQL

11© 2016 MapR Technologies 11

Benefits• Extended relational model allows massive simplification

– On a real example, we see >20x reduction in number of tables

• Simplification drives improved introspection– This is good

• Apache Drill gives very high performance execution for extended relational problems

• You can try this out today

Page 12: Evolving from RDBMS to NoSQL + SQL

12© 2016 MapR Technologies 12© 2016 MapR Technologies© 2016 MapR Technologies

A New Database for JSON Data

Page 13: Evolving from RDBMS to NoSQL + SQL

13© 2016 MapR Technologies 13

Basics of the API• http://ojai.github.io/

• Entry point to a table - DocumentStore– insert()– insertOrReplace()– find()– delete()– replace()– update()– increment()

Page 14: Evolving from RDBMS to NoSQL + SQL

14© 2016 MapR Technologies 14

Working with JSON in Java• Step 1 – Create instance of JSON Serializer

Gson gson = new Gson();

• Step 2 – Serialize POJO to JSONString json = gson.toJson(myObject);

• Step 3 – Deserialize JSON into POJOMyObject myObject = gson.fromJson(json, MyObject.class);

Page 15: Evolving from RDBMS to NoSQL + SQL

15© 2016 MapR Technologies 15

Creating Documents in Java OJAI• Use static methods on class org.ojai.json.Json

Document doc = Json.newDocument(myObject);Document doc = Json.newDocument(jsonString);

• Alternatively– Use builders– Stream from disk– Use InputStream

Page 16: Evolving from RDBMS to NoSQL + SQL

16© 2016 MapR Technologies 16

Creating New Documents• DocumentStore.insert(doc)

Done!

• DocumentStore.insertOrReplace(doc)

Done!

Easy right?

Page 17: Evolving from RDBMS to NoSQL + SQL

17© 2016 MapR Technologies 17

Updating Existing Documents• DocumentStore.update(_id, DocumentMutation)

• Mutation methods– mutation.append(FieldPath, “user visited URL”);– mutation.set(“field.name”, “What a great example”);– mutation.increment(“field”, 1);– mutation.merge(“field”, Map<String, Object>);– mutation.setOrReplace(…);– mutation.delete(field);

Yes, these are atomic.

Page 18: Evolving from RDBMS to NoSQL + SQL

18© 2016 MapR Technologies 18

Deleting Documents• DocumentStore.delete(doc);

Done!

• DocumentStore.delete(_id);

Done!

This is easy too, right?

Page 19: Evolving from RDBMS to NoSQL + SQL

19© 2016 MapR Technologies 19

Finding Documents• DocumentStore.find(QueryCondition);

• Query condition setup:– qc.is(“field”, EQUAL, “blue”)

.and().notExists(“other.field”)

.or().like(“field”, “%purple”)

.or().matches(“another.field”, “regular expression”)

Page 20: Evolving from RDBMS to NoSQL + SQL

20© 2016 MapR Technologies 20© 2016 MapR Technologies© 2016 MapR Technologies

Querying JSON Data and More

Page 21: Evolving from RDBMS to NoSQL + SQL

21© 2016 MapR Technologies 21

How to Bring SQL to Non-Relational Data Stores?

Familiarity of SQL Agility of NoSQL

• ANSI SQL semantics

• BI (Tableau, MicroStrategy,

etc.)

• Low latency

• No schema management– HDFS (Parquet, JSON, etc.)– HBase– …

• No transformation– No silos of data

• Ease of use

Page 22: Evolving from RDBMS to NoSQL + SQL

22© 2016 MapR Technologies 22

Drill Supports Schema Discovery On-The-Fly

• Fixed schema• Leverage schema in centralized

repository (Hive Metastore)

• Fixed schema, evolving schema or schema-less

• Leverage schema in centralized repository or self-describing data

2Schema Discovered On-The-FlySchema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY

Page 23: Evolving from RDBMS to NoSQL + SQL

23© 2016 MapR Technologies 23

Drill’s Data Model is Flexible

JSONBSON

HBase

ParquetAvro

CSVTSV

Dynamic schemaFixed schema

Complex

Flat

Flexibility

Name Gender AgeMichael M 6Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Flex

ibili

ty

Page 24: Evolving from RDBMS to NoSQL + SQL

24© 2016 MapR Technologies 24

Enabling “As-It-Happens” Business with Instant Analytics

Hadoop data Data modeling TransformationData

movement(optional)

Users

Hadoop data Users

Traditionalapproach

Exploratory approach

New Business questionsSource data evolution

Total time to insight: weeks to months

Total time to insight: minutes

Page 25: Evolving from RDBMS to NoSQL + SQL

25© 2016 MapR Technologies 25

Evolution Towards Self-Service Data Exploration

Data Modeling and Transformation

Data Visualization

IT-driven

IT-driven

IT-driven

Self-service

IT-driven

Self-service

Optional

Self-service

Traditional BIw/ RDBMS

Self-Service BIw/ RDBMS SQL-on-Hadoop

Self-Service Data Exploration

Zero-day analytics

Page 26: Evolving from RDBMS to NoSQL + SQL

26© 2016 MapR Technologies 26

Common Use Cases

Raw Data Exploration JSON Analytics DWH offload

Hive HBaseFiles Directories…

{JSON}, ParquetText Files …

Page 27: Evolving from RDBMS to NoSQL + SQL

27© 2016 MapR Technologies 27

- Sub-directory- HBase namespace- Hive database

Drill Enables ‘SQL-on-Everything’

SELECT * FROM dfs.yelp.`business.json`

Workspace- Pathnames- Hive table- HBase table

Table

- DFS (Text, Parquet, JSON)- HBase/MapR-DB- Hive Metastore/HCatalog- Easy API to go beyond Hadoop

Storage plugin instance

Page 28: Evolving from RDBMS to NoSQL + SQL

28© 2016 MapR Technologies 28

Reuse Existing SQL Tools and Skills

Leverage SQL-compatible tools (BI, query builders, etc.) via Drill’s standard ODBC, JDBC and ANSI SQL support

Enable business analysts, technical analysts and data scientists to explore and analyze large volumes of real-time data

Page 29: Evolving from RDBMS to NoSQL + SQL

29© 2016 MapR Technologies 29© 2016 MapR Technologies© 2016 MapR Technologies

Security Controls

Page 30: Evolving from RDBMS to NoSQL + SQL

30© 2016 MapR Technologies 30

Access Controls that Scale

PAM Authentication + User Impersonation

Fine-grained row and column level access control with Drill Views – no centralized security repository required

Files HBase Hive

Drill View 1

Drill View 2

UUU

U

U

Page 31: Evolving from RDBMS to NoSQL + SQL

31© 2016 MapR Technologies 31

Granular Security via Drill Views

Name City State Credit Card #Dave San Jose CA 1374-7914-3865-4817

John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)OwnerAdmins

Permission Admins

Business Analyst Data Scientist

Name City State Credit Card #

Dave San Jose

CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist View (/views/maskedcards.csv)

Not a physical data copy

Name City State

Dave San Jose

CA

John Boulder CO

Business Analyst View

OwnerAdmins

Permission Business Analysts

OwnerAdmins

Permission Data

Scientists

Page 32: Evolving from RDBMS to NoSQL + SQL

32© 2016 MapR Technologies 32

Ownership ChainingCombine Self Service Exploration with Data Governance

Name City State Credit Card #

Dave San Jose CA 1374-7914-3865-4817

John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)

Name City State Credit Card #

Dave San Jose CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist (/views/V_Scientist)

Jane (Read)John (Owner)

Name City State

Dave San Jose CA

John Boulder CO

Analyst(/views/V_Analyst)

Jack (Read)Jane(Owner)

RAW

FILEV

_Scientist

V_A

nalyst

Does Jack have access to V_Analyst? ->YES

Who is the owner of V_Analyst? ->Jane

Drill accesses V_Analyst as Jane (Impersonation hop 1)

Does Jane have access to V_Scientist ? -> YES

Who is the owner of V_Scientist? ->John

Drill accesses V_Scientist as John (Impersonation hop 2)

John(Owner)

Does John have permissions on raw file? -> YES

Who is the owner of raw file? ->John

Drill accesses source file as John (no impersonation here)

Jack queries the view V_Analyst

*Ownership chain length (# hops) is configurable

Ownership chaining

Access path

Page 33: Evolving from RDBMS to NoSQL + SQL

33© 2016 MapR Technologies 33

Security Summary• Logical

– No physical data copies/silos

• Granular– Row level and column level security controls

• De-centralized– User impersonation respecting storage system permissions

– No separate permission repository for granular controls

– Integrated with Hadoop File System permissions and LDAP

• Self-service w/ governance– If you have access to data, you control who and how widely can access it

– Audits

Page 34: Evolving from RDBMS to NoSQL + SQL

34© 2016 MapR Technologies 34© 2016 MapR Technologies© 2016 MapR Technologies

Using Drill with Yelp

Page 35: Evolving from RDBMS to NoSQL + SQL

35© 2016 MapR Technologies 35

Business dataset {"business_id": "4bEjOyTaDG24SY5TxsaUNQ","full_address": "3655 Las Vegas Blvd S\nThe Strip\nLas Vegas, NV 89109","hours": {

"Monday": {"close": "23:00", "open": "07:00"},"Tuesday": {"close": "23:00", "open": "07:00"},"Friday": {"close": "00:00", "open": "07:00"},"Wednesday": {"close": "23:00", "open": "07:00"},"Thursday": {"close": "23:00", "open": "07:00"},"Sunday": {"close": "23:00", "open": "07:00"},"Saturday": {"close": "00:00", "open": "07:00"}

},"open": true,"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],"city": "Las Vegas","review_count": 4084,"name": "Mon Ami Gabi","neighborhoods": ["The Strip"],"longitude": -115.172588519464,"state": "NV","stars": 4.0,

"attributes": {"Alcohol": "full_bar”,

"Noise Level": "average","Has TV": false,"Attire": "casual","Ambience": {

"romantic": true,"intimate": false,"touristy": false,"hipster": false,

"classy": true,"trendy": false,

"casual": false},"Good For": {"dessert": false, "latenight": false, "lunch": false,

"dinner": true, "breakfast": false, "brunch": false},}

}

Page 36: Evolving from RDBMS to NoSQL + SQL

36© 2016 MapR Technologies 36

Zero to Results in 2 minutes$ tar -xvzf apache-drill-1.9.0.tar.gz

$ bin/sqlline -u jdbc:drill:zk=local$ bin/drill-embedded> SELECT state, city, count(*) AS businesses FROM dfs.yelp.`business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10;+------------+------------+-------------+| state | city | businesses |+------------+------------+-------------+| NV | Las Vegas | 12021 || AZ | Phoenix | 7499 || AZ | Scottsdale | 3605 || EDH | Edinburgh | 2804 || AZ | Mesa | 2041 || AZ | Tempe | 2025 || NV | Henderson | 1914 || AZ | Chandler | 1637 || WI | Madison | 1630 || AZ | Glendale | 1196 |+------------+------------+-------------+

Install

Query files and

directories

Results

Launch shell (embedded mode)

Page 37: Evolving from RDBMS to NoSQL + SQL

37© 2016 MapR Technologies 37

Directories are implicit partitions

SELECT dir0, SUM(amount)FROM salesGROUP BY dir1 IN (q1, q2)

sales├── 2014│   ├── q1│   ├── q2│   ├── q3│   └── q4└── 2015 └── q1

Page 38: Evolving from RDBMS to NoSQL + SQL

38© 2016 MapR Technologies 38

Intuitive SQL Access to Complex Data// It’s Friday 10pm in Vegas and looking for Hummus

> SELECT name, stars, b.hours.Friday friday, categories FROM dfs.yelp.`business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2;

+------------+------------+------------+------------+| name | stars | friday | categories |+------------+------------+------------+------------+| Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] || Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |+------------+------------+------------+------------+

Query data with any levels of nesting

Page 39: Evolving from RDBMS to NoSQL + SQL

39© 2016 MapR Technologies 39

Reviews dataset

{ "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"}

Page 40: Evolving from RDBMS to NoSQL + SQL

40© 2016 MapR Technologies 40

ANSI SQL Compatibility//Get top cool rated businesses

SELECT b.name from dfs.yelp.`business.json` b WHERE b.business_id IN (SELECT r.business_id FROM dfs.yelp.`review.json` r GROUP BY r.business_id HAVING SUM(r.votes.cool) > 2000 ORDER BY SUM(r.votes.cool) DESC);

+------------+| name |+------------+| Earl of Sandwich || XS Nightclub || The Cosmopolitan of Las Vegas || Wicked Spoon |+------------+

Use familiar SQL functionality

(Joins, Aggregations, Sorting, Sub-queries, SQL data types)

Page 41: Evolving from RDBMS to NoSQL + SQL

41© 2016 MapR Technologies 41

Logical Views //Create a view combining business and reviews datasets

> CREATE OR REPLACE VIEW dfs.tmp.BusinessReviews AS SELECT b.name, b.stars, r.votes.funny, r.votes.useful, r.votes.cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id;+------------+------------+| ok | summary |+------------+------------+| true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema |+------------+------------+

> SELECT COUNT(*) AS Total FROM dfs.tmp.BusinessReviews;+------------+| Total |+------------+| 1125458 |+------------+

Lightweight file system based views for granular

and de-centralized

data management

Page 42: Evolving from RDBMS to NoSQL + SQL

42© 2016 MapR Technologies 42

Materialized Views AKA Tables> ALTER SESSION SET `store.format` = 'parquet';

> CREATE TABLE dfs.yelp.BusinessReviewsTbl AS SELECT b.name, b.stars, r.votes.funny funny, r.votes.useful useful, r.votes.cool cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id;+------------+---------------------------+| Fragment | Number of records written |+------------+---------------------------+| 1_0 | 176448 || 1_1 | 192439 || 1_2 | 198625 || 1_3 | 200863 || 1_4 | 181420 || 1_5 | 175663 |+------------+---------------------------+

Save analysis results as tables using familiar CTAS

syntax

Page 43: Evolving from RDBMS to NoSQL + SQL

43© 2016 MapR Technologies 43

Repeated Values Support// Flatten repeated categories

> SELECT name, categories FROM dfs.yelp.`business.json` LIMIT 3;

+------------+------------+| name | categories |+------------+------------+| Eric Goldberg, MD | ["Doctors","Health & Medical"] || Pine Cone Restaurant | ["Restaurants"] || Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |+------------+------------+

> SELECT name, FLATTEN(categories) AS categories FROM dfs.yelp.`business.json` LIMIT 5;+------------+------------+| name | categories |+------------+------------+| Eric Goldberg, MD | Doctors || Eric Goldberg, MD | Health & Medical || Pine Cone Restaurant | Restaurants || Deforest Family Restaurant | American (Traditional) || Deforest Family Restaurant | Restaurants |+------------+------------+

Dynamically flatten

repeated and nested data elements as part of SQL queries. No ETL necessary

Page 44: Evolving from RDBMS to NoSQL + SQL

44© 2016 MapR Technologies 44

Checkins dataset {    "checkin_info":{       "3-4":1,      "13-5":1,      "6-6":1,      "14-5":1,      "14-6":1,      "14-2":1,      "14-3":1,      "19-0":1,      "11-5":1,      "13-2":1,      "11-6":2,      "11-3":1,      "12-6":1,      "6-5":1,      "5-5":1,      "9-2":1,      "9-5":1,      "9-6":1,      "5-2":1,      "7-6":1,      "7-5":1,      "7-4":1,      "17-5":1,      "8-5":1,      "10-2":1,      "10-5":1,      "10-6":1   },   "type":"checkin",   "business_id":"JwUE5GmEO-sH1FuwJgKBlQ"}

Page 45: Evolving from RDBMS to NoSQL + SQL

45© 2016 MapR Technologies 45

Supports Dynamic / Unknown Columns> SELECT KVGEN(checkin_info) checkins FROM dfs.yelp.`checkin.json` LIMIT 1;+------------+| checkins |+------------+| [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14-5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19-0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11-3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9-2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7-6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8-5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] |+------------+

> SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.yelp.`checkin.json` limit 6;+------------+| checkins |+------------+| {"key":"3-4","value":1} || {"key":"13-5","value":1} || {"key":"6-6","value":1} || {"key":"14-5","value":1} || {"key":"14-6","value":1} || {"key":"14-2","value":1} |+------------+

Convert Map with a wide set of dynamic columns into an array of key-value pairs

Page 46: Evolving from RDBMS to NoSQL + SQL

46© 2016 MapR Technologies 46© 2016 MapR Technologies© 2016 MapR Technologies

Resources

Page 47: Evolving from RDBMS to NoSQL + SQL

47© 2016 MapR Technologies 47

Drill is Top-Ranked SQL-on-Hadoop

Source: Gigaom Research, 2015

Key: • Number indicates companies relative strength across all vectors• Size of ball indicates company’s relative strength along individual vector

“Drill isn’t just about

SQL-on-Hadoop.

It’s about SQL-on-

pretty-much-

anything,

immediately, and

without formality.”

Page 48: Evolving from RDBMS to NoSQL + SQL

48© 2016 MapR Technologies 48

Page 49: Evolving from RDBMS to NoSQL + SQL

49© 2016 MapR Technologies 49

OJAI and MapR-DBWhere to find it…

– The source: https://github.com/ojai/ojai– The site: http://ojai.github.io/– Python bindings: https://github.com/mapr-demos/python-bindings– Javascript bindings: https://github.com/mapr-demos/js-bindings

Ready to play with your data?– Download the sandbox: http://maprdb.io– Examples:

• Java: https://github.com/mapr-demos/maprdb-ojai-101• Python: https://github.com/mapr-demos/maprdb_python_examples

Page 50: Evolving from RDBMS to NoSQL + SQL

50© 2016 MapR Technologies 50

Drill Walkthrough• Example queries• Conversion from relational model to flat JSON model

https://www.mapr.com/blog/drilling-healthy-choices

https://www.mapr.com/blog/evolution-database-schemas-using-sql-nosql

Page 52: Evolving from RDBMS to NoSQL + SQL

52© 2016 MapR Technologies 52

@kingmesal

[email protected]

Engage with us!

kingmesal