31
1 © Cloudera, Inc. All rights reserved. Simplifying Analytic Workloads via Complex Schemas Josh Wills, Alex Behm, and Marcel Kornacker Data Modeling for Data Science

Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

Embed Size (px)

Citation preview

Page 1: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

1© Cloudera, Inc. All rights reserved.

Simplifying Analytic Workloads via Complex SchemasJosh Wills, Alex Behm, and Marcel Kornacker

Data Modeling for Data Science

Page 2: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

2© Cloudera, Inc. All rights reserved.

• Relational databases are optimized to work with flat schemas• Much of the useful information in

the world is stored using complex schemas (a.k.a., the “document model”)• Log files•NoSQL data stores• REST APIs

On Complex Schemas

Page 3: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

3© Cloudera, Inc. All rights reserved.

Handling Complex Schemas in SQL

Page 4: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

4© Cloudera, Inc. All rights reserved.

Complex Schemas and Data Science

Page 5: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

5© Cloudera, Inc. All rights reserved.

Think Like A Data Scientist

Page 6: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

6© Cloudera, Inc. All rights reserved.

Building Data Science Teams: A Moneyball Approach

Page 7: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

7© Cloudera, Inc. All rights reserved.

Leveraging The Power of SQL Engines…

Page 8: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

8© Cloudera, Inc. All rights reserved.

…While Managing the Cost of SQL Engines

Page 9: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

9© Cloudera, Inc. All rights reserved.

Intentional Complex Schemas

Page 10: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

10© Cloudera, Inc. All rights reserved.

Better Resource Utilization

Page 11: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

11© Cloudera, Inc. All rights reserved.

Higher Data Quality

Page 12: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

12© Cloudera, Inc. All rights reserved.

Simplifying Analytic Queries

Page 13: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

13© Cloudera, Inc. All rights reserved.

Data Science For Everyone

Page 14: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

14© Cloudera, Inc. All rights reserved.

Nested Types in Impala

Support for Nested Data Types: STRUCT, MAP, ARRAY

Natural Extensions to SQL

Full Expressiveness of SQL with Nested Types

Page 15: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

15© Cloudera, Inc. All rights reserved.

Example TPCH-like SchemaCREATE TABLE Customers { id BIGINT, address STRUCT< city: STRING, zip: INT > orders ARRAY<STRUCT< date: TIMESTAMP, price: DECIMAL(12,2), cc: BIGINT, items: ARRAY<STRUCT< item_no: STRING, price: DECIMAL(9,2) >> >>}

Page 16: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

16© Cloudera, Inc. All rights reserved.

Impala Syntax Extensions

• Path expressions extend column references to scalars (nested structs)• Can appear anywhere a conventional column reference is used

• Collections (maps and arrays) are exposed like sub-tables• Use FROM clause to specify which collections to read like conventional tables• Can use JOIN conditions to express join relationship (default is INNER JOIN)

Find the ids and city of customers who live in the zip code 94305:

SELECT id, address.city FROM customers WHERE address.zip = 94305

Find all orders that were paid for with a customer’s preferred credit card:

SELECT o.txn_id FROM customers c, c.orders o WHERE o.cc = c.preferred_cc

Page 17: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

17© Cloudera, Inc. All rights reserved.

Referencing Arrays & Maps

• Basic idea: Flatten nested collections referenced in the FROM clause• Can be thought of as an implicit join on the parent/child relationship

SELECT c.id, o.date FROM customers c, c.c_orders o

c.id o.date

100 2012-05-03

100 2013-07-08

100 2011-01-29

… …

101 2014-02-04

101 2015-11-15

… …

id of a customer repeated for every order

Page 18: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

18© Cloudera, Inc. All rights reserved.

Referencing Arrays & Maps

SELECT c.c_custkey, o.o_orderkeyFROM customers c, c.c_orders o

Returns customer/order data for customers that have at least one order

SELECT c.c_custkey, o.o_orderkeyFROM customers c LEFT OUTER JOIN c.c_orders o

Also returns customers with no orders (with order fields NULL)

SELECT c.c_custkey, o.o_orderkeyFROM customers c LEFT ANTI JOIN c.orders o

Find all customers that have no orders

SELECT c.c_custkey, o.o_orderkeyFROM customers c INNER JOIN c.c_orders o

Page 19: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

19© Cloudera, Inc. All rights reserved.

Motivation for Advanced Querying Capabilities

• Count the number of orders per customer

• Count the number of items per order

• Impractical Requires unique id at every nesting level• Information is already expressed in nesting relationship!

• What about even more interesting queries?• Get the number of orders and the average item price per customer?• “Group by” multiple nesting levels at the same time

SELECT COUNT(*) FROM customers c, c.orders o GROUP BY c.id

SELECT COUNT(*) FROM customers.orders o, o.items GROUP BY ???

Must be unique

Page 20: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

20© Cloudera, Inc. All rights reserved.

Advanced Querying: Aggregates over Nested Collections

• Count the number of orders per customer

• Count the number of items per order

• Get the number of orders and the average item price per customer

SELECT id, COUNT(orders) FROM customers

SELECT date, COUNT(items) FROM customers.orders

SELECT id, COUNT(orders), AVG(orders.items.price) FROM customers

Page 21: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

21© Cloudera, Inc. All rights reserved.

Advanced Querying: Relative Table References

• Full expressibility of SQL with nested collections• Arbitrary SQL allowed in inline views and subqueries with relative table refs

• Exploits nesting relationship• No need for stored unique ids at all interesting levels

SELECT id FROM customers cWHERE (SELECT AVG(price) FROM c.orders.items) > 1000

SELECT id FROM customers c JOIN(SELECT price FROM c.orders ORDER BY date DESC LIMIT 2) v

Page 22: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

22© Cloudera, Inc. All rights reserved.

Impala Nested Types in ActionJosh Wills’ Blog post on analyzing misspelled querieshttp://blog.cloudera.com/blog/2014/08/how-to-count-events-like-a-data-scientist/

• Goal: Rudimentary spell checker based on counting query/click events• Problem: Cross referencing items in multiple nested collections• Representative of many machine learning tasks (Josh tells me)

• Goal was not naturally achievable with Hive • Josh implemented a custom Hive extension “WITHIN”• How can Impala serve this use case?

Page 23: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

23© Cloudera, Inc. All rights reserved.

Impala Nested Types in Action

account_id: bigintsearch_events: array<struct< event_id: bigint query: string tstamp_sec: bigint ...>>install_events: array<struct< event_id: bigint search_event_id: bigint app_id: bigint ...>>

SELECT bad.query, good.query, count(*) as cntFROM sessions s, s.search_events bad, s.search_events good,WHERE bad.tstamp_sec < good.tstamp_sec AND good.tstamp_sec - bad.tstamp_sec < 30 AND bad.event_id NOT IN (select search_event_id FROM s.install_events) AND good.event_id IN (select search_event_id FROM s.install_events),GROUP BY bad.query, good.query

Schema

Page 24: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

24© Cloudera, Inc. All rights reserved.

Design Considerations for Complex Schemas

• Turn parent-child hierarchies into nested collections• logical hierarchy becomes physical hierarchy• nested structure = join index

physical clustering of child with parent• For distributed, big data systems, this matter:

distributed join turns into local join

Page 25: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

25© Cloudera, Inc. All rights reserved.

Flat TPCH vs. Nested TPCH

Part PartSupp

Supplier Lineitem

Customer Orders

Nation Region

PartPartSupp

Suppliers

Lineitems

Customer

Orders

Nation

Region1 N

1N

1N

1

N

N

1 N1

N

1N 1

N

1

1N

N

1

Page 26: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

26© Cloudera, Inc. All rights reserved.

Columnar Storage for Complex Schemas

• Columnar storage: a necessity for processing nested data at speed• complex schemas = really wide tables• row-wise storage: read lots of data you don’t care about

• Columnar formats effective store a join index

{id: 17 orders: [{date: 2014-01-29, price: 10.2}, {date: ….} …]}

{id: 18 orders: [{date: 2015-04-10, price: 2000.0}, {date: ….} …]}

17

18

10.2

22.1

100

200

id orders.price

Page 27: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

27© Cloudera, Inc. All rights reserved.

Columnar Storage for Complex Schemas

• A “join” between parent and child is essentially free:• coordinated scan of parent and child columns• data effectively pre-sorted on parent PK•merge join beats hash join!

• Aggregating child data is cheap:• the data is already pre-grouped by the parent• Amenable to vectorized execution• Local non-grouping aggregation <<< distributed grouping aggregation

Page 28: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

28© Cloudera, Inc. All rights reserved.

Analytics on Nested Data: Cheap & Easy!

SELECT c.id, AVG(orders.items.price) avg_priceFROM customerORDER BY avg_price DESC LIMIT 10

Query: Find the 10 customers that buy the most expensive items on average

SELECT c.id, AVG(i.price) avg_priceFROM customer c, order o, item iWHERE c.id = o.cid and o.id = i.oidGROUP BY c.idORDER BY avg_price DESC LIMIT 10

Flat Nested

Page 29: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

29© Cloudera, Inc. All rights reserved.

Scan Scan

Xch Xch

Join

Scan

Xch

Join

Agg

Xch

Agg

Scan

Unnest

Agg

Top-n

Xch

Top-n

Top-n

Xch

Top-n

Analytics on Nested Data: Cheap & Easy!

$$$

Page 30: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

30© Cloudera, Inc. All rights reserved.

A Note to BI Tool Vendors

Please support nested types, it’s not that hard!• you already take advantage of declared PK/FK relationships• a nested collection isn’t really that different• except you don’t need a join predicate

Page 31: Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala

31© Cloudera, Inc. All rights reserved.

Thank you