Upload
empowered-holdings-llc
View
2.329
Download
5
Tags:
Embed Size (px)
DESCRIPTION
This is a presentation I gave in 2006 for Bill Inmon. The presentation covers Data Vault and how it integrates with Bill Inmon's DW2.0 vision. This is focused on the business intelligence side of the house. IF you want to use these slides, please put (C) Dan Linstedt, all rights reserved, http://LearnDataVault.com
Citation preview
The Application of Data Vault to DW2.0
© Dan Linstedt, 2011-2012 all rights reserved
2
A bit about me…
• Author, Inventor, Speaker – and part time photographer…
• 25+ years in the IT industry• Worked in DoD, US Gov’t, Fortune 50, and
so on…
• Find out more about the Data Vault:o http://www.youtube.com/LearnDataVaulto http://LearnDataVault.com
• Full profile on http://www.LinkedIn.com/dlinstedt
04/10/2023Do Not Duplicate Without Written Permission
3
Agenda• Defining The Needs for the Data Vault
o DW2.0 Architectureo DW2.0 Drivers for Data Modelingo Divergence of Data Models over Time
• Data Vault in DW2.0o Defining the Data Vaulto What does one look like?o Modeling in DW2.0o Applying Data Vault to Global DW2.0o Applying Data Vault to Time-Value DW2.0o Compliance in DW2.0o Applying Data Vault to System of Record
• The Paradox of DW2.0o Volume, Latency, Complexity,
Normalization andTransformation ability
04/10/2023Do Not Duplicate Without Written Permission
4
DW2.0 Architecture
Interactive
Archival
Integrated
Near-Line
METADATA
Tactical
Historical
Strategic
Extended
Enterprise Data Warehouse
Active Data Mining
TransformationActive
Cleansing
Cube Processing
TemporalIndexing
SemanticManagement
Enterprise Service Bus
ESB Connectivity:• EAI• EII• ETL / ELT• Web Services
ESB Management:• Text • Email • Spread Sheets• Transaction• Structured Information
Unstructured Data:• Email• Plain Text• Word Docs• Images
Data Models Must be consistently
applied throughout all layers.
04/10/2023Do Not Duplicate Without Written Permission
5
DW2.0 Drivers for Data Modeling
• Data Models are one of the main integration points between Technical and Business drivers.
• Business Keys drive understandability, and granularity• Normalization drives flexibility, and frequency of load• Raw data sets in the EDW/ADW drive compliance and volume
VolumeVolume FrequencyFrequency
GranularityGranularity
DataModel
FlexibilityFlexibility ComplianceCompliance
UnderstandabilityUnderstandability
DataModel
Technical Drivers Business Drivers
04/10/2023Do Not Duplicate Without Written Permission
6
Divergence of Data Models over
Time• Data models (both logical and physical) have diverged from
business drivers and direction over time.• The Data Models have driven towards physical improvements
instead of towards business improvements.• The Data Vault Architecture drives data modeling back to the
business sides of the house.
Time
Business Goals
Standard Data Modeling
Data Vault ModelingBusiness Process Modeling
04/10/2023Do Not Duplicate Without Written Permission
7
Agenda• Defining The Needs for the Data Vault
o DW2.0 Architectureo DW2.0 Drivers for Data Modelingo Divergence of Data Models over Time
• Data Vault in DW2.0o Defining the Data Vaulto What does one look like?o Modeling in DW2.0o Applying Data Vault to Global DW2.0o Applying Data Vault to Time-Value DW2.0o Compliance in DW2.0o Applying Data Vault to System of Record
• The Paradox of DW2.0o Volume, Latency, Complexity,
Normalization andTransformation ability
Image is from - What The Bleep Do We Know?
04/10/2023Do Not Duplicate Without Written Permission
8
Defining the Data Vault
The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business.
It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.
Defining the Data VaultTDAN.com Article
04/10/2023Do Not Duplicate Without Written Permission
9
What Does One Look Like?
Customer
Sat
Sat
Sat
F(x)
Customer Information
Account
Sat
Sat
Sat
F(x)
Account Information
InvoiceID
Sat
Sat
Sat
F(x)
Invoice / Billing Information
The impact of linking disparate systems together, is inside the shaded area.
Link
F(x)
Sat
Records a history of the interaction
Elements:• Hub• Link• Satellite
04/10/2023Do Not Duplicate Without Written Permission
10
Modeling in DW2.0• Bill Says:
o DW2.0 must be brought down to a very finite level of detail.
o The starting point for DW2.0 is the modeling process.o The data model applies to the integrated sector, the
near line sector, and the archival sector.o The way that data warehouses are built is in an
incremental manner• The Data Vault specializes in:
o Providing finite grain at the lowest level possible,o Mapping business process models to data modelso Existing in all sectors simultaneously without
changes.o Flexibility and managing change so that impacts are
not a mile-wide and 10 miles deep.
04/10/2023Do Not Duplicate Without Written Permission
11
Elements in a Data Vault• Hub
o Unique List of Business Keys, tracked by the first time the warehouse saw them appear.
• Linko Relationships between business keys, also
representing a grain shift, or a hierarchical roll-up.
• Satelliteo Data over time, granular, and descriptive about
the business key. Also setup according to type of information, and rate of change.
04/10/2023 Do Not Duplicate Without Written Permission 12
Applying the Data Vault to Global DW2.0
HubHub
SatSatSatSatLinkLink
Manufacturing EDW in China
Base EDW Created in CorporateFinancials in USA
HubHub
SatSatSatSat
HubHub
SatSatSatSat
LinkLink
SatSatSatSat
HubHub
SatSatSatSat
Planning in Brazil
LinkLink
HubHub
SatSatSatSatLinkLink
04/10/2023Do Not Duplicate Without Written Permission
13
Applying the Data Vault to Time-Value
DW2.0
1
10-12-2000
Acme Incorporated
Super Ducts
Finance
1
12-2-2000
Acme Inc
Super Ducts
Contracts
1
10-31-2000
Acme Incorporated
Super Ducts
Finance
Cust_Key
Load_Date
Name
Description
Record Src
Row 1 Row 2 Row 3 Row 4
Satellite entities in the Data Vault house data over time. They are split by type of information and rate of change. This is an
example set of data for a customer name satellite.
1
10-14-2000
Acme Corp, Inc
Super Ducts
Finance
Satellite Data Over Time
04/10/2023Do Not Duplicate Without Written Permission
14
Batch and Real-Time Data Arrival
1128589388
10-12-2000 16:43
ABC12356
1UX2589a
$10.00
DBT
Transaction ID
Date Stamp
Customer
Account #
Amount
Type
Hub Customer
Hub Customer
LinkTransaction
LinkTransaction
Hub Acct
Hub Acct
SatTransaction
SatTransaction
SatCustomer
SatCustomer
SatAcct
SatAcct
Customer InfoCustomer Info
Acct DataAcct Data
3, 6 or 12 Hr Load Window
All InsertsAll the time
Batch Load
04/10/2023Do Not Duplicate Without Written Permission
15
Star Schema Real-Time Data Issues
1128589388
10-12-2000 16:43
ABC12356
1UX2589a
$10.00
DBT
Transaction ID
Date Stamp
Customer
Account #
Amount
Type
DimensionCustomer
DimensionCustomer
FactTransaction
FactTransaction
DimensionAccount
DimensionAccount
Customer InfoCustomer Info
Acct DataAcct Data
3, 6 or 12 Hr Load Window
Updates areREQUIRED!
Batch Load
Cleansing & Quality must occur before the data can reach the target tables, cleansing and quality introduce unwanted latency!
04/10/2023Do Not Duplicate Without Written Permission
16
Compliance in DW2.0
• Raw Detail = auditable
• Loads in Real-Time or in Batch
• Integrated by Business Key
• Flexible, allows business changes (with little to no impact)
• No delay in loading data
• Data type conformity
• Semantic Integration
Source Systems
EDW / ADWData Vault
Data MartsData Delivery
RawIntegration
BusinessRules
ErrorMart
TrueMarts
User orAuditor
Changes to Source Information
Direction of Information Flow
Master Data(Operational)
Continuous Data
Improvement
Quality
04/10/2023Do Not Duplicate Without Written Permission
17
Applying the Data Vault to System Of
Record
• SOR 1 o Data Capture, Data Produced by system algorithms
• SOR 2o Raw Detailed Integrated Data over time, Integrated by Horizontal
(functional) Business Key. Auditable.• SOR 3
o Current view of the business, merged, quality cleansed, single copy, single source, feeds operational systems.
Source Systems Normalized EDWMaster Data or
Conformed Dimensions
SORDefinition 1
SORDefinition 2
SORDefinition 3
04/10/2023Do Not Duplicate Without Written Permission
18
DW2.0 Paradoxes• DW2.0 incorporates:
o Unstructured, Semi-Structured, Real-Time, and Batch Datao Global viewso All of which drive volumes of data.
• Volume causes latency in transformation.• Volume is directly proportional to transformation
complexity.• Real-Time data arrival is inversely proportional to
complexity and volume.• Time for “quality, cleansing, and transformation” on the
way in to the EDW diminishes as near-real-time is approached, or massive volumes of batch data are found within a shrinking batch window.
• Transformation can destroy data audit ability and compliance of the EDW / ADW.
04/10/2023Do Not Duplicate Without Written Permission
19
DW2.0 Paradoxes - Imagery
DW2.0DW2.0
VolumeVolume
Real-TimeTransactionsReal-Time
TransactionsUnstructured
DataUnstructured
DataLow-Level
GrainLow-Level
Grain
LowLatency
LowLatency
Drives
Increases
RequiresMerging, Quality,
CleansingMerging, Quality,
CleansingData Model
DenormalizationData Model
Denormalization Data ModelNormalization& Raw Details
Data ModelNormalization& Raw Details
Pushes
Requires
Fights
Fights
Fights
Auditability & ComplianceAuditability & Compliance
InhibitsInhibits
Provides
04/10/2023Do Not Duplicate Without Written Permission
20
DW2.0 Paradox Hypothesis• As we reach near-real time, the ability to transform data
and “wait” for parent dependencies directly decreases, the data decay rates increase, and therefore can cause data death if not processed in time.
• Normalization of the data model increases flexibility, and scalability.
• The closer we get to near-real-time, the more normalized the data model in the EDW/ADW must become.
• In order to process high volumes of batch data extremely fast, the “business transformations” must be removed from the load stream of the EDW.
04/10/2023Do Not Duplicate Without Written Permission
21
Data Vault Volumetrics
Cust Addr 41.15 MB
Cust Company 22.36 MB
Cust Detail 10.00 MB
Cust Hub 8.20 MB
Cust Name 28.00 MB
Initial Total Size 109 MB (200k Rows)
Monthly Growth Rate (new customers)
15% / Month
16.45 MB
Volumetrics (10% null Data)
Upon Initial Investigation, the 12 month growth rate for new customers is 197.4 MB per year….
Now let’s factor in the DELTA’s.
04/10/2023Do Not Duplicate Without Written Permission
22
Data Vault Growth
Table Initial Size (Data & Indexes) Avg Growth Per Week
Avg Growth Per Month
Avg Growth Per Year
Cust Addr 41.15 MB 5% = 2.0 MB
0% 104 MB
Cust Company 22.36 MB 0% 0% 10% = 2.23 MB
Cust Detail 10.00 MB 10% = 1.0 MB
varies 12 MB
Cust Hub 8.20 MB 0% 0% 0%
Cust Name 28.00 MB 0% 0% 5% = 1.4 MB
Initial Total Size
109 MB (200k Rows) 1.0 MB - 119.63 MB
Growth Rate 15% / Month (16.45 MB) - - 197.40 MB
TOTAL GROWTH / YEAR - - 317.03 MB
Volumetrics (10% null Data) – Delta Growth Only
Original Dimension: 497.16 MB per Year
New Data Vault:317.03 MB Per Year
04/10/2023Do Not Duplicate Without Written Permission
23
Data Vault VS Dimension Growth
0
500
1000
1500
2000
2500
3000
Initial Size Year 1 Year 2 Year 3 Year 4 Year 5
Gig
ab
yte
s
Dimension
Data Vault
Initial Size Year 1 Year 2 Year 3 Year 4 Year 5
Dimension 114 611.16 1108.32 1605.48 2102.64 2599.8
Data Vault 109 426.03 742.06 1059.09 1376.12 1693.15
How does the extensive growth rate affect queries?
04/10/2023Do Not Duplicate Without Written Permission
24
SummarizationBusiness:• Lack of a single view of a
customer, product, service, etc...
• Lack of visibility into ALL information across the enterprise.
• Competition does it better, faster, cheaper.
• Unable to identify and forecast business trends and their impacts.
• WHERE’S THE KNOWLEDGE? OR IS IT JUST ALL DATA?
Technical:• Near-Real-Time (Active)• Huge Data Volumes• Massive Data Dis-Integration• Spread-Marts• Convergence of Operational
and Strategic Questions• Duplication of data in the
ODS, Warehouse, and Data Marts!
• Dimension-itis!!• ODS Ulcer!• Fact Table Granularity• JUNK tables, Helper Tables
25
Where To Learn More• The Technical Modeling Book: http://LearnDataVault.com
• The Discussion Forums: & eventshttp://LinkedIn.com – Data Vault Discussions
• Contact me:http://DanLinstedt.com - web [email protected] - email
• World wide User Group (Free)http://dvusergroup.com