Upload
daniel-upton
View
892
Download
1
Embed Size (px)
Citation preview
DecisionLab.Net
business intelligence is business performance _________________________________________________________________________________________________________________________________________________________________________________________________________________________
________________________________________________________________________________________________________________________________________________________________________________________________________________________
DecisionLab http://www.decisionlab.net [email protected] direct 760.525.3268 Carlsbad, California, USA
Data Modeling for
Integration of NoSQL
with a Data Warehouse
by daniel upton
How
__________________________________________________________________________________________________________________________________________________________________________________
Page 2 of 28
Data Modeling for the Integration
of NoSQL with a Data Warehouse:
SQL Saturday #449, San Diego, Sept 19, 2015
by daniel upton
data warehouse developer / modeler / architect
certified scrum master
DecisionLab.Net business intelligence is business performance
[email protected] blog: www.decisionlab.net connect: www.linkedin.com/in/DanielUpton
__________________________________________________________________________________________________________________________________________________________________________________
Page 3 of 28
Opening Questions…
o Why model data?
o What role does visualization play in a data model?
o Why integrate RDBMS data warehouse with NoSQL data?
o What do I mean by “integrate data”?
o Why model data for an integration between an RDBMS Data Warehouse and NoSQL?
o Should data ever be moved between a Data Warehouse and NoSQL? If so, which way?
o Regardless of a decision to move data or not, at what stage in an RDBMS DW environment should we integrate with NoSQL?
o Staging, EDW, Star Schema, Extracts?
o How do Lean and Agile thinking influence our choice between these stages or methods?
o What does a useful model for data integration between NoSQL and RDBMS communicate?
o How well do various Data Modelling methods support integration with NoSQL?
o Does a data scientist need a star schema, or any single version of the truth structure, to obtain needed answers?
o What are some practical guidelines for when we actually need to accomplish this?
__________________________________________________________________________________________________________________________________________________________________________________
Page 4 of 28
Why do data modeling?
__________________________________________________________________________________________________________________________________________________________________________________
Page 5 of 28
Q: Why do data modeling? A: Process of learning about and defining a data structure either in its current or desired state in an information system. o Once completed, used to…
o Automatically instantiate a modeled data structure into an information system o Communicate a complex data structure among people to validate it and increase shared understanding
What role does visualization play?
__________________________________________________________________________________________________________________________________________________________________________________
Page 6 of 28
Visualization is vital for communication: Physical proximity, vertical position, relationship lines, colors o Without visualization aids, a diagram goes from being a communication tool to a communication obstacle.
__________________________________________________________________________________________________________________________________________________________________________________
Page 7 of 28
Why integrate a Data Warehouse with NoSQL? o Business requirements: High-value analytics and business intelligence often require the integration
of data across disparate data sources. The Industry is not talking much about it yet:
o No generally accepted methods
o Big technical skills gap between RDBMS and NoSQL
o Immaturity of methods and tools for both NoSQL modeling and for its integration with RDBMS
o Default Assumption: Integration will always require Extraction, Transformation, Loading, and is
therefore a major project.
__________________________________________________________________________________________________________________________________________________________________________________
Page 8 of 28
What exactly do I mean by “data integration”? o Identify and “instantiate” joins at specific granularities between different data sets according to
specific common topics in both sets – customer, click, like, product, purchase, inventory, shipment – to exploit now and later.
o Actual data movement and ETL are optional. Why model for Integration of RDBMS and NoSQL? o Very effective process for defining, visualizing, validating and communicating even more complex
data structures in current or desired state.
Quick Tips: For good ‘Model-Level Integration’ of RDBMS and NoSQL… o Keep it simple and source-facing. Avoid complex data transformation o Model for simple equi-join relationships: ‘one to many’ or ‘one to one’
Future Modeling Technology: o Forward- and reverse-engineering to / from a combined, integrated RDBMS and NoSQL information
system.
__________________________________________________________________________________________________________________________________________________________________________________
Page 9 of 28
Should data ever be moved between a Data Warehouse and NoSQL? If so, which way? Real world: Data will be moved in all directions… in ways good, bad and ugly! Who thinks RDBMS and NoSQL integration should look like this? How likely is it?
NoSQL purists think is should look like the opposite, with the DW merely as a data source for NoSQL.
__________________________________________________________________________________________________________________________________________________________________________________
Page 10 of 28
Specific methods from the major platform vendors are either high-level or proprietary. No common practice is yet accepted for integration between RDBMS and NoSQL. It’s the Wild West!
High level data flow in nearly all directions Proprietary (Polybase with PDW / MS Analytics Platform)
__________________________________________________________________________________________________________________________________________________________________________________
Page 11 of 28
What about integration without data movement… without ETL? Tip: Between the DW and NoSQL, avoid data movement just for the sake of integration. o The goal of data integration is not, by itself, a sufficient justification to either move or substantially
transform the data, because of the additional overhead that such movement and transformation requires.
_____________________ Regardless of a decision to move or not move the data, at what stage in a Data Warehouse environment should we integrate with NoSQL? Staging, EDW, Star Schema, Extracts? o Staging: Increments of data, lacking enforceable referential integrity, so inherently non-integrated,
thus offers low integration potential.
__________________________________________________________________________________________________________________________________________________________________________________
Page 12 of 28
Enterprise Data Warehouse (Inmon): Entity-Relational Model
o ~3rd Normal Form, Date-Stamped Composite Primary Keys, No Surrogates
o Strategy to enforce a single version of the truth
(SVOT), so each characteristic (attribute) something (entity) exists in just one field and one table, with each instance as one record.
o Inherent, intentional rigid interdependence between classic 3NF tables, based on foreign key constraints
o Pristine data structure is often too rigidly normalized for model-level integration with NoSQL structures that play by different rules.
o Lean / Agile Score?: Low. Rigid table structure with strong functional dependencies and specific cardinality baked into design. SVOT design requires data transformation from other sources to comply.
__________________________________________________________________________________________________________________________________________________________________________________
Page 13 of 28
Dimensional / Star Schema: Either as standalone DW Bus (Kimball), or downstream from EDW as data presentation layer.
o Intent is to present a SVOT for pre-defined analyses baked into star schema
o Rigid functional dependence between tables o Descriptive data is now in a de-normalized
dimension table with foreign key relationships only to fact tables containing quantitative fields.
o Lean / Agile Score: Even lower. Even
more rigid structure, with added surrogate-
keys wherein dimensions relate only to
existing RDBMS fact tables for pre-defined
analyses. Unique ID’s such as
Department_Code_14 become non-unique
(denormalized), thus weaker for new
integrations.
o NoSQL Integration requires new Star
Schema tables.
__________________________________________________________________________________________________________________________________________________________________________________
Page 14 of 28
…and… o To use a pattern-based separation of keys, attributes, and relationships to accomplish the above while remaining transparently
equivalent and auditable to source data.
o Lean / Agile Score: High. Each ensemble stands alone. Hubs, the sole integration point to other ensembles, have zero functional dependencies. Relationship cardinality between ensembles becomes an association, accepting any cardinality based on actual data, not pre-defined business rules. New data subject areas (ensembles) are easily added and introduce zero new functional dependencies on existing structure.
Data Vault Method: o Summary of Hubs, Satellites, Links,
Ensembles (Linstedt, Hultgren, Graziano).
o Align data records, via their business keys, across tables and across systems.
o Track changes to source data records while maintaining or enhancing actual referential integrity between related tables.
o To defer the following-- (a) the renaming of source attributes per DW naming standards; (b) the selection of desired fields and records to present for reporting; and (c) any application of subjective business rules or an SVOT attempt, until immediately downstream of the model -- to a Star Schema or Semantic Layer.
__________________________________________________________________________________________________________________________________________________________________________________
Page 15 of 28
For more insights into the Lean Data Warehouse and Data Vault concepts, see…
www.slideshare.net/DanielUpton/lean-data-warehouse-via-data-vault
__________________________________________________________________________________________________________________________________________________________________________________
Page 16 of 28
NoSQL Data Models: Cassandra Example…
What To Expect in a NoSQL Model…
If modelled… o De-normalization,
o Pivotted data,
Maybe no model at all o Some Records with a different
number of fields than adjacent records.
o Super-columns and sub-columns,
o Some (not all) of these items may prevent meaningful integration-modeling between NoSQL and RDBMS.
__________________________________________________________________________________________________________________________________________________________________________________
Page 17 of 28
o Even more different than RDBMS, the hierarchic data model of a document-based NoSQL store involves nested attributes (with or without unique identifiers) o Example: JavaScript Object Notation (JSON) Document: Student Likes Major (same content)
{ “MajorID”: “985”, -- Top-level (parent) object with ID (Business Key) “MajorName”: “Data Science”, “Student_Likes_Major”: { -- Nested (child) object with ID’s (Business Key…
“Student_1_Likes_Major”: -- for reliable equi-joins on Student_ID) { “Student_ID”: “1357”,
“Student_Name”: “Hannah Shelby”, “Student_Like_Major_As_Role”: “2nd Major”, “Date_Liked”: “2015_0804”, “Student_Like_NumDays_After_Survey_Posted_Social”: “4” },
“Student_2_Likes_Major”: { “Student_ID”: “2468”,
“Student_Name”: “David Bookman”, “Student_Like_Major_As_Role”: “Minor”, “Date_Liked”: “2015_0801”, “Student_Like_NumDays_After_Survey_Posted_Social”: “1” },
… “Student_N_Likes_Major”:
…}, “Major_Academic_Counselor_Current”: [ -- Nested Array (no ID’s; no reliable equi-joins)
{ “Counselor”: “Ms. Jenny Davis, M.Ed”, “Counselor_Specialty_Name”: “Career Prep” },
], }
__________________________________________________________________________________________________________________________________________________________________________________
Page 18 of 28
Business Scenario: o In Social Network Survey, a (one) university student Likes multiple combinations of 1st Majors, 2nd Majors, and
Minors, but the University has not officially allowed them, nor do core OLTP systems support them. o Registrar OLTP and legacy 3NF EDW use business rule that only allows Many Students [enrolled in] One Major.
Objectives:
o Build a new RDBMS Data Warehouse / Business Intelligence Solution
o With little or no modifications, “Production-alize” existing NoSQL data repositories from the Social Network
(which uses Cassandra and/or a JSON Document Store), and then somehow integrate that data with the above planned DW / BI for integrated analytics combining students liking major-combinations with other analytically interesting data (eg. actual major, academic standing, credits earned, GPA) in the registrar system.
Implementation Goals:
o Assumption: Available (generic) virtualization API (Polybase, Talend, Informatica, etc.) in which we abstract-out
and then visually map fields between RDBMS and NoSQL Data fields in existing structures and, once mapped, can also query and join these mapped data sets simultaneously, for real-time analytics, or as a semantic layer with which to subsequently move data, either way based on business requirements as they unfold.
o No ETL, no new fields in existing tables, and no new RDBMS Tables.
__________________________________________________________________________________________________________________________________________________________________________________
Page 19 of 28
A model diagram should communicate: (a) abstraction levels; (b) representation; (c) referential integrity, (d) API
What is wrong with the model to the left?
_______________________________________________________________________________ Reference: “Conceptual and Objective Modeling Notation” (COMN), by Ted Hills o Just the representation lines used here use COMN style. o Potential extension to UML-modeling (not ER) notation. Not adopted by a leading modeling tool.
o For details on COMN, see: http://www.tewdur.com/index.php
__________________________________________________________________________________________________________________________________________________________________________________
Page 20 of 28
Data modeling now depicts referential integrity and direct representation across abstraction levels: Meaningful and useful.
__________________________________________________________________________________________________________________________________________________________________________________
Page 21 of 28
Comparison of RDBMS DW Data Models in Integration Scenarios: o Dimensional / Star Schema (Kimball)
o NoSQL offers no existing fact table, nor does it anywhere use the surrogate keys (Dim_Student_ID, Dim_Program_ID) o Existing Star Schema offers no fact table relevant to Students Liking Majors o Our intent is no new RDBMS tables, so…
So, we abandon the Star schema as our integration stage since it does not meet our requirement.
__________________________________________________________________________________________________________________________________________________________________________________
Page 22 of 28
o Third Normal Form DW (Inmon): Abstraction Levels
API makes NoSQL level viewable with RDBMS
__________________________________________________________________________________________________________________________________________________________________________________
Page 23 of 28
Third Normal RDBMS and NoSQL: Detailed Data Model
What used to approximate an SVOT now looks like a raw landing zone. Consider an alternative to 3NF EDW to avoid mixing up the two.
__________________________________________________________________________________________________________________________________________________________________________________
Page 24 of 28
Lean Data Warehouse: High Level Abstraction Levels: JSON Document Integration
__________________________________________________________________________________________________________________________________________________________________________________
Page 25 of 28
Lean Data Warehouse: High Level Abstraction Levels: Cassandra Integration
__________________________________________________________________________________________________________________________________________________________________________________
Page 26 of 28
Lean Data Warehouse: Detailed Data Model: Cassandra Integration
Lean DW and Data Vault defer SVOT attempts to downstream data presentation, instead loosely coupling source data. Does a true data scientist need a pristine data presentation area for querying? You choose… [ Yes / No ] If you have to design, script and ETL new RDBMS tables for each new integration, can you keep up with demand? [ Yes / No ]
__________________________________________________________________________________________________________________________________________________________________________________
Page 27 of 28
Recommendations:
1. Differentiate short-lived vs. long-lived NoSQL data structures: For integration with RDBMS, prefer long-lived, reasonably modeled NoSQL data sets.
2. Criteria for good ‘Model-Level Integration’ of RDBMS and NoSQL:
o Keep it simple. Model for simple ‘one to many’ or ‘one to one’ equi-join relationships.
o Excessive model-level data transformation is the killer of transparency in an integration data model.
3. For NoSQL integration target files / documents / tables, insist on the equivalence of…
o 1st Normal Form: In every record, each cell holds only one value. Higher normalizations are obviously better.
o Identifier fields (eg. integers), as key candidates, exist and correspond to each in-scope ‘name’ attribute.
o Clearly distinguish data warehouse from data presentation layer (eg. Star Schema), and don’t over-burden DW itself with analytic-requirements-driven, highly-transformed (brittle) SVOT attempt.
o Save SVOT transforms and other business rules for downstream ETL into data presentation.
o Strive for generic, loosely-coupled integration without ETL at the EDW Level.
4. Minimize data movement between RDBMS and NoSQL in order to simplify integration and reduce overhead cost.
5. Design loosely-coupled Lean Data Warehouses, rather than tightly-dependent data warehouse / mart as all-at-once attempts at the elusive SVOT, thus drawing a sharp distinction between where the lobster is caught and cooked from where it is served with wine and song to your valued customers.
__________________________________________________________________________________________________________________________________________________________________________________
Page 28 of 28
DecisionLab.Net
Services: _____________________________________________________________________
Data Warehouse / Business Intelligence Envisioning, Assessment, Roadmap, and Assessment Expert DW-BI Staff Augmentation:
Data Warehouse / Mart / Analytics Architecture, Requirements, Models and Development
________________________________________________________________________________________________________________
Slides available now at… slideshare.net/DanielUpton/
_______________________________________________________________________________________________________________
Daniel Upton [email protected] Carlsbad, CA blog: http://www.decisionlab.net phone 760.525.3268