28
Decision Lab . Net business intelligence is business performance _________________________________________________________________________________________________________________________________________________________________________________________________________________________ ________________________________________________________________________________________________________________________________________________________________________________________________________________________ DecisionLab http://www.decisionlab.net [email protected] direct 760.525.3268 Carlsbad, California, USA Data Modeling for Integration of NoSQL with a Data Warehouse by daniel upton How

Data Modeling for Integration of NoSQL with a Data Warehouse

Embed Size (px)

Citation preview

Page 1: Data Modeling for Integration of NoSQL with a Data Warehouse

DecisionLab.Net

business intelligence is business performance _________________________________________________________________________________________________________________________________________________________________________________________________________________________

________________________________________________________________________________________________________________________________________________________________________________________________________________________

DecisionLab http://www.decisionlab.net [email protected] direct 760.525.3268 Carlsbad, California, USA

Data Modeling for

Integration of NoSQL

with a Data Warehouse

by daniel upton

How

Page 2: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 2 of 28

Data Modeling for the Integration

of NoSQL with a Data Warehouse:

SQL Saturday #449, San Diego, Sept 19, 2015

by daniel upton

data warehouse developer / modeler / architect

certified scrum master

DecisionLab.Net business intelligence is business performance

[email protected] blog: www.decisionlab.net connect: www.linkedin.com/in/DanielUpton

Page 3: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 3 of 28

Opening Questions…

o Why model data?

o What role does visualization play in a data model?

o Why integrate RDBMS data warehouse with NoSQL data?

o What do I mean by “integrate data”?

o Why model data for an integration between an RDBMS Data Warehouse and NoSQL?

o Should data ever be moved between a Data Warehouse and NoSQL? If so, which way?

o Regardless of a decision to move data or not, at what stage in an RDBMS DW environment should we integrate with NoSQL?

o Staging, EDW, Star Schema, Extracts?

o How do Lean and Agile thinking influence our choice between these stages or methods?

o What does a useful model for data integration between NoSQL and RDBMS communicate?

o How well do various Data Modelling methods support integration with NoSQL?

o Does a data scientist need a star schema, or any single version of the truth structure, to obtain needed answers?

o What are some practical guidelines for when we actually need to accomplish this?

Page 4: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 4 of 28

Why do data modeling?

Page 5: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 5 of 28

Q: Why do data modeling? A: Process of learning about and defining a data structure either in its current or desired state in an information system. o Once completed, used to…

o Automatically instantiate a modeled data structure into an information system o Communicate a complex data structure among people to validate it and increase shared understanding

What role does visualization play?

Page 6: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 6 of 28

Visualization is vital for communication: Physical proximity, vertical position, relationship lines, colors o Without visualization aids, a diagram goes from being a communication tool to a communication obstacle.

Page 7: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 7 of 28

Why integrate a Data Warehouse with NoSQL? o Business requirements: High-value analytics and business intelligence often require the integration

of data across disparate data sources. The Industry is not talking much about it yet:

o No generally accepted methods

o Big technical skills gap between RDBMS and NoSQL

o Immaturity of methods and tools for both NoSQL modeling and for its integration with RDBMS

o Default Assumption: Integration will always require Extraction, Transformation, Loading, and is

therefore a major project.

Page 8: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 8 of 28

What exactly do I mean by “data integration”? o Identify and “instantiate” joins at specific granularities between different data sets according to

specific common topics in both sets – customer, click, like, product, purchase, inventory, shipment – to exploit now and later.

o Actual data movement and ETL are optional. Why model for Integration of RDBMS and NoSQL? o Very effective process for defining, visualizing, validating and communicating even more complex

data structures in current or desired state.

Quick Tips: For good ‘Model-Level Integration’ of RDBMS and NoSQL… o Keep it simple and source-facing. Avoid complex data transformation o Model for simple equi-join relationships: ‘one to many’ or ‘one to one’

Future Modeling Technology: o Forward- and reverse-engineering to / from a combined, integrated RDBMS and NoSQL information

system.

Page 9: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 9 of 28

Should data ever be moved between a Data Warehouse and NoSQL? If so, which way? Real world: Data will be moved in all directions… in ways good, bad and ugly! Who thinks RDBMS and NoSQL integration should look like this? How likely is it?

NoSQL purists think is should look like the opposite, with the DW merely as a data source for NoSQL.

Page 10: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 10 of 28

Specific methods from the major platform vendors are either high-level or proprietary. No common practice is yet accepted for integration between RDBMS and NoSQL. It’s the Wild West!

High level data flow in nearly all directions Proprietary (Polybase with PDW / MS Analytics Platform)

Page 11: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 11 of 28

What about integration without data movement… without ETL? Tip: Between the DW and NoSQL, avoid data movement just for the sake of integration. o The goal of data integration is not, by itself, a sufficient justification to either move or substantially

transform the data, because of the additional overhead that such movement and transformation requires.

_____________________ Regardless of a decision to move or not move the data, at what stage in a Data Warehouse environment should we integrate with NoSQL? Staging, EDW, Star Schema, Extracts? o Staging: Increments of data, lacking enforceable referential integrity, so inherently non-integrated,

thus offers low integration potential.

Page 12: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 12 of 28

Enterprise Data Warehouse (Inmon): Entity-Relational Model

o ~3rd Normal Form, Date-Stamped Composite Primary Keys, No Surrogates

o Strategy to enforce a single version of the truth

(SVOT), so each characteristic (attribute) something (entity) exists in just one field and one table, with each instance as one record.

o Inherent, intentional rigid interdependence between classic 3NF tables, based on foreign key constraints

o Pristine data structure is often too rigidly normalized for model-level integration with NoSQL structures that play by different rules.

o Lean / Agile Score?: Low. Rigid table structure with strong functional dependencies and specific cardinality baked into design. SVOT design requires data transformation from other sources to comply.

Page 13: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 13 of 28

Dimensional / Star Schema: Either as standalone DW Bus (Kimball), or downstream from EDW as data presentation layer.

o Intent is to present a SVOT for pre-defined analyses baked into star schema

o Rigid functional dependence between tables o Descriptive data is now in a de-normalized

dimension table with foreign key relationships only to fact tables containing quantitative fields.

o Lean / Agile Score: Even lower. Even

more rigid structure, with added surrogate-

keys wherein dimensions relate only to

existing RDBMS fact tables for pre-defined

analyses. Unique ID’s such as

Department_Code_14 become non-unique

(denormalized), thus weaker for new

integrations.

o NoSQL Integration requires new Star

Schema tables.

Page 14: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 14 of 28

…and… o To use a pattern-based separation of keys, attributes, and relationships to accomplish the above while remaining transparently

equivalent and auditable to source data.

o Lean / Agile Score: High. Each ensemble stands alone. Hubs, the sole integration point to other ensembles, have zero functional dependencies. Relationship cardinality between ensembles becomes an association, accepting any cardinality based on actual data, not pre-defined business rules. New data subject areas (ensembles) are easily added and introduce zero new functional dependencies on existing structure.

Data Vault Method: o Summary of Hubs, Satellites, Links,

Ensembles (Linstedt, Hultgren, Graziano).

o Align data records, via their business keys, across tables and across systems.

o Track changes to source data records while maintaining or enhancing actual referential integrity between related tables.

o To defer the following-- (a) the renaming of source attributes per DW naming standards; (b) the selection of desired fields and records to present for reporting; and (c) any application of subjective business rules or an SVOT attempt, until immediately downstream of the model -- to a Star Schema or Semantic Layer.

Page 15: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 15 of 28

For more insights into the Lean Data Warehouse and Data Vault concepts, see…

www.slideshare.net/DanielUpton/lean-data-warehouse-via-data-vault

Page 16: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 16 of 28

NoSQL Data Models: Cassandra Example…

What To Expect in a NoSQL Model…

If modelled… o De-normalization,

o Pivotted data,

Maybe no model at all o Some Records with a different

number of fields than adjacent records.

o Super-columns and sub-columns,

o Some (not all) of these items may prevent meaningful integration-modeling between NoSQL and RDBMS.

Page 17: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 17 of 28

o Even more different than RDBMS, the hierarchic data model of a document-based NoSQL store involves nested attributes (with or without unique identifiers) o Example: JavaScript Object Notation (JSON) Document: Student Likes Major (same content)

{ “MajorID”: “985”, -- Top-level (parent) object with ID (Business Key) “MajorName”: “Data Science”, “Student_Likes_Major”: { -- Nested (child) object with ID’s (Business Key…

“Student_1_Likes_Major”: -- for reliable equi-joins on Student_ID) { “Student_ID”: “1357”,

“Student_Name”: “Hannah Shelby”, “Student_Like_Major_As_Role”: “2nd Major”, “Date_Liked”: “2015_0804”, “Student_Like_NumDays_After_Survey_Posted_Social”: “4” },

“Student_2_Likes_Major”: { “Student_ID”: “2468”,

“Student_Name”: “David Bookman”, “Student_Like_Major_As_Role”: “Minor”, “Date_Liked”: “2015_0801”, “Student_Like_NumDays_After_Survey_Posted_Social”: “1” },

… “Student_N_Likes_Major”:

…}, “Major_Academic_Counselor_Current”: [ -- Nested Array (no ID’s; no reliable equi-joins)

{ “Counselor”: “Ms. Jenny Davis, M.Ed”, “Counselor_Specialty_Name”: “Career Prep” },

], }

Page 18: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 18 of 28

Business Scenario: o In Social Network Survey, a (one) university student Likes multiple combinations of 1st Majors, 2nd Majors, and

Minors, but the University has not officially allowed them, nor do core OLTP systems support them. o Registrar OLTP and legacy 3NF EDW use business rule that only allows Many Students [enrolled in] One Major.

Objectives:

o Build a new RDBMS Data Warehouse / Business Intelligence Solution

o With little or no modifications, “Production-alize” existing NoSQL data repositories from the Social Network

(which uses Cassandra and/or a JSON Document Store), and then somehow integrate that data with the above planned DW / BI for integrated analytics combining students liking major-combinations with other analytically interesting data (eg. actual major, academic standing, credits earned, GPA) in the registrar system.

Implementation Goals:

o Assumption: Available (generic) virtualization API (Polybase, Talend, Informatica, etc.) in which we abstract-out

and then visually map fields between RDBMS and NoSQL Data fields in existing structures and, once mapped, can also query and join these mapped data sets simultaneously, for real-time analytics, or as a semantic layer with which to subsequently move data, either way based on business requirements as they unfold.

o No ETL, no new fields in existing tables, and no new RDBMS Tables.

Page 19: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 19 of 28

A model diagram should communicate: (a) abstraction levels; (b) representation; (c) referential integrity, (d) API

What is wrong with the model to the left?

_______________________________________________________________________________ Reference: “Conceptual and Objective Modeling Notation” (COMN), by Ted Hills o Just the representation lines used here use COMN style. o Potential extension to UML-modeling (not ER) notation. Not adopted by a leading modeling tool.

o For details on COMN, see: http://www.tewdur.com/index.php

Page 20: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 20 of 28

Data modeling now depicts referential integrity and direct representation across abstraction levels: Meaningful and useful.

Page 21: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 21 of 28

Comparison of RDBMS DW Data Models in Integration Scenarios: o Dimensional / Star Schema (Kimball)

o NoSQL offers no existing fact table, nor does it anywhere use the surrogate keys (Dim_Student_ID, Dim_Program_ID) o Existing Star Schema offers no fact table relevant to Students Liking Majors o Our intent is no new RDBMS tables, so…

So, we abandon the Star schema as our integration stage since it does not meet our requirement.

Page 22: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 22 of 28

o Third Normal Form DW (Inmon): Abstraction Levels

API makes NoSQL level viewable with RDBMS

Page 23: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 23 of 28

Third Normal RDBMS and NoSQL: Detailed Data Model

What used to approximate an SVOT now looks like a raw landing zone. Consider an alternative to 3NF EDW to avoid mixing up the two.

Page 24: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 24 of 28

Lean Data Warehouse: High Level Abstraction Levels: JSON Document Integration

Page 25: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 25 of 28

Lean Data Warehouse: High Level Abstraction Levels: Cassandra Integration

Page 26: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 26 of 28

Lean Data Warehouse: Detailed Data Model: Cassandra Integration

Lean DW and Data Vault defer SVOT attempts to downstream data presentation, instead loosely coupling source data. Does a true data scientist need a pristine data presentation area for querying? You choose… [ Yes / No ] If you have to design, script and ETL new RDBMS tables for each new integration, can you keep up with demand? [ Yes / No ]

Page 27: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 27 of 28

Recommendations:

1. Differentiate short-lived vs. long-lived NoSQL data structures: For integration with RDBMS, prefer long-lived, reasonably modeled NoSQL data sets.

2. Criteria for good ‘Model-Level Integration’ of RDBMS and NoSQL:

o Keep it simple. Model for simple ‘one to many’ or ‘one to one’ equi-join relationships.

o Excessive model-level data transformation is the killer of transparency in an integration data model.

3. For NoSQL integration target files / documents / tables, insist on the equivalence of…

o 1st Normal Form: In every record, each cell holds only one value. Higher normalizations are obviously better.

o Identifier fields (eg. integers), as key candidates, exist and correspond to each in-scope ‘name’ attribute.

o Clearly distinguish data warehouse from data presentation layer (eg. Star Schema), and don’t over-burden DW itself with analytic-requirements-driven, highly-transformed (brittle) SVOT attempt.

o Save SVOT transforms and other business rules for downstream ETL into data presentation.

o Strive for generic, loosely-coupled integration without ETL at the EDW Level.

4. Minimize data movement between RDBMS and NoSQL in order to simplify integration and reduce overhead cost.

5. Design loosely-coupled Lean Data Warehouses, rather than tightly-dependent data warehouse / mart as all-at-once attempts at the elusive SVOT, thus drawing a sharp distinction between where the lobster is caught and cooked from where it is served with wine and song to your valued customers.

Page 28: Data Modeling for Integration of NoSQL with a Data Warehouse

__________________________________________________________________________________________________________________________________________________________________________________

Page 28 of 28

DecisionLab.Net

Services: _____________________________________________________________________

Data Warehouse / Business Intelligence Envisioning, Assessment, Roadmap, and Assessment Expert DW-BI Staff Augmentation:

Data Warehouse / Mart / Analytics Architecture, Requirements, Models and Development

________________________________________________________________________________________________________________

Slides available now at… slideshare.net/DanielUpton/

_______________________________________________________________________________________________________________

Daniel Upton [email protected] Carlsbad, CA blog: http://www.decisionlab.net phone 760.525.3268