Upload
others
View
15
Download
3
Embed Size (px)
Citation preview
A company of Daimler AG
LECTURE @DHBW: DATA WAREHOUSE
REVIEW QUESTIONSANDREAS BUCKENHOFER, DAIMLER TSS
ABOUT ME
https://de.linkedin.com/in/buckenhofer
https://twitter.com/ABuckenhofer
https://www.doag.org/de/themen/datenbank/in-memory/
http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/
https://www.xing.com/profile/Andreas_Buckenhofer2
Andreas BuckenhoferSenior DB [email protected]
Since 2009 at Daimler TSS Department: Big Data Business Unit: Analytics
As a 100% Daimler subsidiary, we give
100 percent, always and never less.
We love IT and pull out all the stops to
aid Daimler's development with our
expertise on its journey into the future.
Our objective: We make Daimler the
most innovative and digital mobility
company.
NOT JUST AVERAGE: OUTSTANDING.
Daimler TSS
INTERNAL IT PARTNER FOR DAIMLER
+ Holistic solutions according to the Daimler guidelines
+ IT strategy
+ Security
+ Architecture
+ Developing and securing know-how
+ TSS is a partner who can be trusted with sensitive data
As subsidiary: maximum added value for Daimler
+ Market closeness
+ Independence
+ Flexibility (short decision making process,
ability to react quickly)
Daimler TSS 4
Daimler TSS
LOCATIONS
Data Warehouse / DHBW
Daimler TSS China
Hub Beijing
10 employees
Daimler TSS Malaysia
Hub Kuala Lumpur
42 employeesDaimler TSS IndiaHub Bangalore22 employees
Daimler TSS Germany
7 locations
1000 employees*
Ulm (Headquarters)
Stuttgart
Berlin
Karlsruhe
* as of August 2017
5
• Distributed data: data is spread across applications
• Different data structures: each system has its won data model
• Missing integrated view: data is not harmonized / standardized
• Missing historic data: OLTP normally stores current data only
• Technological challenges: OLTP has different infrastructure requirements
• System workload: additional workload from DWH users most likely decreases performance for OLTP users
WHICH CHALLENGES COULD NOT BE SOLVED BY OLTP? WHY IS A DWH NECESSARY?
Data Warehouse / DHBWDaimler TSS 6
EXPLAIN TWO DEFINITIONS OF THE DWH
Data Warehouse / DHBWDaimler TSS 7
Ralph Kimball William Harvey „Bill“ Inmon
„A data warehouse is a copy of transaction data specificallystructured for querying and reporting“
“A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management’s decision-making process”
• Subject-oriented
• A DWH is organized around the major themes, not around processes
• Integrated• Data in the DWH is harmonized and uniformly standardized across all sources
• Non-volatile
• Operations in a DW are insert and select (Updates and deletes for technical reasons only)
• Time-variant• All data in the data warehouse is accurate as of some moment in time and has to
be associated with a time stamp
WHICH CHARACTERISTICS DOES A DWH HAVE ACCORDING TO BILL INMON?
Data Warehouse / DHBWDaimler TSS 8
• Staging (Input)
• Integration (Cleansing)
• Core Warehouse (Storage)
• Aggregation
• Mart (Reporting, Output)
• and additionally Metadata, Security, DWH Manager, Monitor
WHICH LAYERS DOES THE LOGICAL STANDARD ARCHITECTURE HAVE?
Data Warehouse / DHBWDaimler TSS 9
LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 10
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging Layer(Input Layer)
OLTP
OLTP
Core Warehouse
Layer(Storage
Layer)
Mart Layer(Output Layer)
(Reporting Layer)
Integration Layer
(Cleansing Layer)
Aggregation Layer
Metadata Management
Security
DWH Manager incl. Monitor
• “Landing Zone” for data coming into a DWH
• Purpose is to increase speed into DWH and decouple source and target system (repeating extraction run, additional delivery)
• Granular data (no pre-aggregation or filtering in the Data Source Layer, i.e. the source system)
• Usually not persistent, therefore regular housekeeping is necessary (for instance delete data in this layer that is few days/weeks old or – more common - if a correct upload to Core Warehouse Layer is ensured)
• Tables have no referential integrity constraints, columns often varchar only
DESCRIBE STAGING LAYER CHARACTERISTICS
Data Warehouse / DHBWDaimler TSS 11
• Data storage in an integrated, consolidated, consistent and non-redundant (normalized) data model
• Contains enterprise-wide data organized around multiple subject-areas
• Application / Reporting neutral data storage on the most detailed level of granularity (incl. historic data)
• Size of database can be several TB and can grow rapidly due to data historization
DESCRIBE CORE WAREHOUSE LAYER CHARACTERISTICS
Data Warehouse / DHBWDaimler TSS 12
• Data is stored in a denormalized data model for performance reasons and better end user usability/understanding
• The Data Mart Layer is providing typically aggregated data or data with less history (e.g. latest years only) in a denormalized data model
• Created through filtering or aggregating the Core Warehouse Layer
• One Mart ideally represents one subject area
• Technically the Data Mart Layer can also be a part of an Analytical Frontend product (such as Qlik, Tableau, or IBM Cognos TM1) and need not to be stored in a relational database
DESCRIBE MART LAYER CHARACTERISTICS
Data Warehouse / DHBWDaimler TSS 13
• 2-layered architecture
• Sum of the data marts constitute the Enterprise DWH
• Enterprise Service Bus / conformed dimensions for integration purposes
(don’t confuse with ESB as middleware/communication system between applications)
• Rather simple approach to make data fast and easily accessible
• Lower startup costs (but higher subsequent development costs)
• If table structures change (instable source systems), high effort to implement the changes and reload data, especially conformed dimensions (“Dimensionitis” desease)
KIMBALL BUS ARCHITECTUREWHAT ARE MAIN CHARACTERISTICS?
Data Warehouse / DHBWDaimler TSS 14
Co
re W
are
ho
use
La
yer
DATA INTEGRATION WITH AND WITHOUT COREWAREHOUSE LAYER
Data Warehouse / DHBWDaimler TSS 15
• Data in Raw Data Vault Layer is regarded as “Single version of the facts”
• Business rules are implemented down-stream
• Core Warehouse Layer is modeled with Data Vault and integrates data by BK (business key) “only”
• Real-Time capability
• Write-back of data in Core Warehouse Layer
DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)WHAT ARE MAIN CHARACTERISTICS?
Data Warehouse / DHBWDaimler TSS 16
WHICH TABLE TYPES ARE USED IN DATA VAULT?
Data Warehouse / DHBWDaimler TSS 17
HOW MANY ROWS ARE STORED IN THE HUB AND LINK TABLES?
Data Warehouse / DHBWDaimler TSS 18
vehicleid model productiondate
engine color
V1 SUV 15.01.13 E1 red
V2 Cabrio 16.01.13 E2 blue
V1 SUV 15.01.13 E1 red
V3 Cabrio 17.01.13 E3 red
Staging Data in table stg_vehicle from 15.01.2015
V1 SUV 16.01.13 E4 red
V4 Cabrio 17.01.13 E5 blue
Staging Data Data in table stg_vehicle from 16.01.2015
V1 SUV 16.01.13 E1 red
Staging Data Data in table stg_vehicle from 17.01.2015
• H_VEHICLE
• 4 rows: V1, V2, V3, V4
• H_ENGINE
• 5 rows: E1, E2, E3, E4, E5
• L_PLUGGED_IN_EFFECTIVITY
• 5 rows: V1-E1, V2-E2, V3-E3, V1-E4, V4-E5
HOW MANY ROWS ARE STORED IN THE HUB AND LINK TABLES?
Data Warehouse / DHBWDaimler TSS 19
HOW MANY ROWS ARE STORED IN THE FIRST 3 SAT TABLES?
Data Warehouse / DHBWDaimler TSS 20
vehicleid model productiondate
engine color
V1 SUV 15.01.13 E1 red
V2 Cabrio 16.01.13 E2 blue
V1 SUV 15.01.13 E1 red
V3 Cabrio 17.01.13 E3 red
Staging Data Data in table stg_vehicle from 15.01.2015
V1 SUV 16.01.13 E4 red
V4 Cabrio 17.01.13 E5 blue
Staging Data Data in table stg_vehicle from 16.01.2015
V1 SUV 16.01.13 E1 red
Staging Data Data in table stg_vehicle from 17.01.2015
HOW MANY ROWS ARE STORED IN THE FIRST 3 SAT TABLES?
Data Warehouse / DHBWDaimler TSS 21
vehicleid model productiondate
engine color
V1 SUV 15.01.13 E1 red
V2 Cabrio 16.01.13 E2 blue
V1 SUV 15.01.13 E1 red
V3 Cabrio 17.01.13 E3 red
Staging Data Data in table stg_vehicle from 15.01.2015
V1 SUV 16.01.13 E4 red
V4 Cabrio 17.01.13 E5 blue
Staging Data Data in table stg_vehicle from 16.01.2015
V1 SUV 16.01.13 E1 red
Staging Data Data in table stg_vehicle from 17.01.2015
5
4
6
4
5
5
WHICH TABLE TYPES ARE USED IN A STAR SCHEMA?
Data Warehouse / DHBWDaimler TSS 22
SCD Type 1
• No History
• Dimension attributes always contain current data
SCD Type 2
• Full Historization
• Dimension contains timestamps
SCD Type 3
• Historization of latest change only
• And storage of current value
WHICH THREE TYPES OF SLOWLY CHANGING DIMENSIONSARE MOST COMMON?
Data Warehouse / DHBWDaimler TSS 23
For an MOLAP model with 4 dimensions having 10, 20, 50 and 100 elements and 50 000 facts what is the size of the cube in terms of number of cells?
• 1 000 000 (= 10 * 20 * 50 * 100)
WHAT IS THE SIZE OF THE CUBE IN TERMS OF NUMBER OF CELLS?
Data Warehouse / DHBWDaimler TSS 24
Which type of OLAP system would you recommend
• for getting fast response times for smaller data sizes
• for achieving an optimal storage utilization
• for an enterprise cube of 10 TB
MOLAP OR ROLAP
Data Warehouse / DHBWDaimler TSS 25
MOLAP
ROLAP
ROLAP
• Precomputation of aggregated values
• Materialized views / query tables store data physically
• Relational Columnar (in-memory) databases
WHICH ROLAP ENHANCEMENTS EXIST?
Data Warehouse / DHBWDaimler TSS 26
WHICH MONITORING TECHNIQUES EXIST TO DETECT CHANGES IN THE SOURCE?
Data Warehouse / DHBWDaimler TSS 27
Trigger-based Replication techniques
Log-based discovery
Timestamp-based discovery
Snapshot-based discovery
Performanceimpact on source system
Medium Low Low Medium High
Performanceimpact on target system
Low Low Low Low High
Load on network Low Low Low Low High
Data loss if nologgingoperations
No Yes Yes No No
WHICH MONITORING TECHNIQUES EXIST TO DETECT CHANGES IN THE SOURCE?
Data Warehouse / DHBWDaimler TSS 28
Trigger-based Replication techniques
Log-based discovery
Timestamp-based discovery
Snapshot-based discovery
Identify DELETE operations
Yes Yes Yes No Yes
Identify ALLchanges (changes between extractions)
Yes Yes Yes No No
WHAT ARE TYPICAL DATA QUALITY PROBLEMS AND POSSIBLE SOLUTIONS?
Data Warehouse / DHBWDaimler TSS 29
Issue Solution
Wrong data e.g. 31.02.2016 Proper data type definition
Wrong values, e.g. number out of range Check constraint
Missing values NOT NULL constraint
Violated references FOREIGN KEY constraint
Duplicates PRIMARY or UNIQUE KEY constraint
Inconsistent data ACID transactions, business logic, additional checks
WHAT ARE TYPICAL DATA QUALITY PROBLEMS AND POSSIBLE SOLUTIONS?
Data Warehouse / DHBWDaimler TSS 30
Issue Solution
Wrong data e.g. 31.02.2016 Proper data type definition
Wrong values, e.g. number out of range Check constraint
Missing values NOT NULL constraint
Violated references FOREIGN KEY constraint
Duplicates PRIMARY or UNIQUE KEY constraint
Inconsistent data ACID transactions, business logic, additional checks
Valid time is the time period during which a fact is true in the real world.
Transaction time is the time period during which a fact stored in the database was known.
Bitemporal data combines both Valid and Transaction Time.
Source: (Wikipedia, https://en.wikipedia.org/wiki/Temporal_database)
Supported in SQL standard
WHAT DOES BITEMPORAL DATA MEAN?
Data Warehouse / DHBWDaimler TSS 31
• Indexing
• Partitioning
• Parallelism
• Compression
• Relational In-memory Columnar DB
• Materialized Views
WHICH PERFORMANCE OPTIMIZING TECHNIQUES FOR A DWH EXIST?
Data Warehouse / DHBWDaimler TSS 32
Top-Down (Inmon)
• Comprehensive approach regarding available data
• Design Core Warehouse Layer = integrated data model first considering all requirements
• Design data marts afterwards
Bottom-Up (Kimball)
• Approach focusing on fast delivery of first results
• Design one data mart first
• Next Marts are modeled afterwards usually using Kimball architecture
• conformed dimensions to integrate different data marts / fact tables
WHICH TWO APPROACHES ARE KNOWN TO BUILD A DWH?
Data Warehouse / DHBWDaimler TSS 33
• Shorter and more frequent ETL cycles (Near Real time ETL)
• Higher requirements for system availability and performance
• Mixed OLAP and OLTP type workload
• Data quality mandatory as data is used for automated decisions
WHAT ARE CHALLENGES FOR OPERATIONAL DWHS?
Data Warehouse / DHBWDaimler TSS 34
Data Warehouse / DHBWDaimler TSS 35
Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle
Data Warehouse / DHBWDaimler TSS 36
THANK YOU
• Example data table
• Hierarchy chart
• Timeline diagrams
• Event matrix
• Enhanced Star schema
WHAT ARE TYPICAL DWH ANALYSIS AND DESIGN WORKPRODUCTS?
Data Warehouse / DHBWDaimler TSS 37
Selection
Slicing
Dicing
Rotate/Pivot
Roll-up/Drill-down
WHICH OPERATION REQUIRES A HIERARCHY?
Data Warehouse / DHBWDaimler TSS 38