38
A company of Daimler AG LECTURE @DHBW: DATA WAREHOUSE REVIEW QUESTIONS ANDREAS BUCKENHOFER, DAIMLER TSS

LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

  • Upload
    others

  • View
    15

  • Download
    3

Embed Size (px)

Citation preview

Page 1: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

A company of Daimler AG

LECTURE @DHBW: DATA WAREHOUSE

REVIEW QUESTIONSANDREAS BUCKENHOFER, DAIMLER TSS

Page 2: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

ABOUT ME

https://de.linkedin.com/in/buckenhofer

https://twitter.com/ABuckenhofer

https://www.doag.org/de/themen/datenbank/in-memory/

http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/

https://www.xing.com/profile/Andreas_Buckenhofer2

Andreas BuckenhoferSenior DB [email protected]

Since 2009 at Daimler TSS Department: Big Data Business Unit: Analytics

Page 3: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

As a 100% Daimler subsidiary, we give

100 percent, always and never less.

We love IT and pull out all the stops to

aid Daimler's development with our

expertise on its journey into the future.

Our objective: We make Daimler the

most innovative and digital mobility

company.

NOT JUST AVERAGE: OUTSTANDING.

Daimler TSS

Page 4: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

INTERNAL IT PARTNER FOR DAIMLER

+ Holistic solutions according to the Daimler guidelines

+ IT strategy

+ Security

+ Architecture

+ Developing and securing know-how

+ TSS is a partner who can be trusted with sensitive data

As subsidiary: maximum added value for Daimler

+ Market closeness

+ Independence

+ Flexibility (short decision making process,

ability to react quickly)

Daimler TSS 4

Page 5: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

Daimler TSS

LOCATIONS

Data Warehouse / DHBW

Daimler TSS China

Hub Beijing

10 employees

Daimler TSS Malaysia

Hub Kuala Lumpur

42 employeesDaimler TSS IndiaHub Bangalore22 employees

Daimler TSS Germany

7 locations

1000 employees*

Ulm (Headquarters)

Stuttgart

Berlin

Karlsruhe

* as of August 2017

5

Page 6: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• Distributed data: data is spread across applications

• Different data structures: each system has its won data model

• Missing integrated view: data is not harmonized / standardized

• Missing historic data: OLTP normally stores current data only

• Technological challenges: OLTP has different infrastructure requirements

• System workload: additional workload from DWH users most likely decreases performance for OLTP users

WHICH CHALLENGES COULD NOT BE SOLVED BY OLTP? WHY IS A DWH NECESSARY?

Data Warehouse / DHBWDaimler TSS 6

Page 7: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

EXPLAIN TWO DEFINITIONS OF THE DWH

Data Warehouse / DHBWDaimler TSS 7

Ralph Kimball William Harvey „Bill“ Inmon

„A data warehouse is a copy of transaction data specificallystructured for querying and reporting“

“A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management’s decision-making process”

Page 8: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• Subject-oriented

• A DWH is organized around the major themes, not around processes

• Integrated• Data in the DWH is harmonized and uniformly standardized across all sources

• Non-volatile

• Operations in a DW are insert and select (Updates and deletes for technical reasons only)

• Time-variant• All data in the data warehouse is accurate as of some moment in time and has to

be associated with a time stamp

WHICH CHARACTERISTICS DOES A DWH HAVE ACCORDING TO BILL INMON?

Data Warehouse / DHBWDaimler TSS 8

Page 9: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• Staging (Input)

• Integration (Cleansing)

• Core Warehouse (Storage)

• Aggregation

• Mart (Reporting, Output)

• and additionally Metadata, Security, DWH Manager, Monitor

WHICH LAYERS DOES THE LOGICAL STANDARD ARCHITECTURE HAVE?

Data Warehouse / DHBWDaimler TSS 9

Page 10: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 10

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse

Layer(Storage

Layer)

Mart Layer(Output Layer)

(Reporting Layer)

Integration Layer

(Cleansing Layer)

Aggregation Layer

Metadata Management

Security

DWH Manager incl. Monitor

Page 11: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• “Landing Zone” for data coming into a DWH

• Purpose is to increase speed into DWH and decouple source and target system (repeating extraction run, additional delivery)

• Granular data (no pre-aggregation or filtering in the Data Source Layer, i.e. the source system)

• Usually not persistent, therefore regular housekeeping is necessary (for instance delete data in this layer that is few days/weeks old or – more common - if a correct upload to Core Warehouse Layer is ensured)

• Tables have no referential integrity constraints, columns often varchar only

DESCRIBE STAGING LAYER CHARACTERISTICS

Data Warehouse / DHBWDaimler TSS 11

Page 12: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• Data storage in an integrated, consolidated, consistent and non-redundant (normalized) data model

• Contains enterprise-wide data organized around multiple subject-areas

• Application / Reporting neutral data storage on the most detailed level of granularity (incl. historic data)

• Size of database can be several TB and can grow rapidly due to data historization

DESCRIBE CORE WAREHOUSE LAYER CHARACTERISTICS

Data Warehouse / DHBWDaimler TSS 12

Page 13: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• Data is stored in a denormalized data model for performance reasons and better end user usability/understanding

• The Data Mart Layer is providing typically aggregated data or data with less history (e.g. latest years only) in a denormalized data model

• Created through filtering or aggregating the Core Warehouse Layer

• One Mart ideally represents one subject area

• Technically the Data Mart Layer can also be a part of an Analytical Frontend product (such as Qlik, Tableau, or IBM Cognos TM1) and need not to be stored in a relational database

DESCRIBE MART LAYER CHARACTERISTICS

Data Warehouse / DHBWDaimler TSS 13

Page 14: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• 2-layered architecture

• Sum of the data marts constitute the Enterprise DWH

• Enterprise Service Bus / conformed dimensions for integration purposes

(don’t confuse with ESB as middleware/communication system between applications)

• Rather simple approach to make data fast and easily accessible

• Lower startup costs (but higher subsequent development costs)

• If table structures change (instable source systems), high effort to implement the changes and reload data, especially conformed dimensions (“Dimensionitis” desease)

KIMBALL BUS ARCHITECTUREWHAT ARE MAIN CHARACTERISTICS?

Data Warehouse / DHBWDaimler TSS 14

Page 15: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

Co

re W

are

ho

use

La

yer

DATA INTEGRATION WITH AND WITHOUT COREWAREHOUSE LAYER

Data Warehouse / DHBWDaimler TSS 15

Page 16: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• Data in Raw Data Vault Layer is regarded as “Single version of the facts”

• Business rules are implemented down-stream

• Core Warehouse Layer is modeled with Data Vault and integrates data by BK (business key) “only”

• Real-Time capability

• Write-back of data in Core Warehouse Layer

DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)WHAT ARE MAIN CHARACTERISTICS?

Data Warehouse / DHBWDaimler TSS 16

Page 17: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

WHICH TABLE TYPES ARE USED IN DATA VAULT?

Data Warehouse / DHBWDaimler TSS 17

Page 18: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

HOW MANY ROWS ARE STORED IN THE HUB AND LINK TABLES?

Data Warehouse / DHBWDaimler TSS 18

vehicleid model productiondate

engine color

V1 SUV 15.01.13 E1 red

V2 Cabrio 16.01.13 E2 blue

V1 SUV 15.01.13 E1 red

V3 Cabrio 17.01.13 E3 red

Staging Data in table stg_vehicle from 15.01.2015

V1 SUV 16.01.13 E4 red

V4 Cabrio 17.01.13 E5 blue

Staging Data Data in table stg_vehicle from 16.01.2015

V1 SUV 16.01.13 E1 red

Staging Data Data in table stg_vehicle from 17.01.2015

Page 19: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• H_VEHICLE

• 4 rows: V1, V2, V3, V4

• H_ENGINE

• 5 rows: E1, E2, E3, E4, E5

• L_PLUGGED_IN_EFFECTIVITY

• 5 rows: V1-E1, V2-E2, V3-E3, V1-E4, V4-E5

HOW MANY ROWS ARE STORED IN THE HUB AND LINK TABLES?

Data Warehouse / DHBWDaimler TSS 19

Page 20: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

HOW MANY ROWS ARE STORED IN THE FIRST 3 SAT TABLES?

Data Warehouse / DHBWDaimler TSS 20

vehicleid model productiondate

engine color

V1 SUV 15.01.13 E1 red

V2 Cabrio 16.01.13 E2 blue

V1 SUV 15.01.13 E1 red

V3 Cabrio 17.01.13 E3 red

Staging Data Data in table stg_vehicle from 15.01.2015

V1 SUV 16.01.13 E4 red

V4 Cabrio 17.01.13 E5 blue

Staging Data Data in table stg_vehicle from 16.01.2015

V1 SUV 16.01.13 E1 red

Staging Data Data in table stg_vehicle from 17.01.2015

Page 21: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

HOW MANY ROWS ARE STORED IN THE FIRST 3 SAT TABLES?

Data Warehouse / DHBWDaimler TSS 21

vehicleid model productiondate

engine color

V1 SUV 15.01.13 E1 red

V2 Cabrio 16.01.13 E2 blue

V1 SUV 15.01.13 E1 red

V3 Cabrio 17.01.13 E3 red

Staging Data Data in table stg_vehicle from 15.01.2015

V1 SUV 16.01.13 E4 red

V4 Cabrio 17.01.13 E5 blue

Staging Data Data in table stg_vehicle from 16.01.2015

V1 SUV 16.01.13 E1 red

Staging Data Data in table stg_vehicle from 17.01.2015

5

4

6

4

5

5

Page 22: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

WHICH TABLE TYPES ARE USED IN A STAR SCHEMA?

Data Warehouse / DHBWDaimler TSS 22

Page 23: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

SCD Type 1

• No History

• Dimension attributes always contain current data

SCD Type 2

• Full Historization

• Dimension contains timestamps

SCD Type 3

• Historization of latest change only

• And storage of current value

WHICH THREE TYPES OF SLOWLY CHANGING DIMENSIONSARE MOST COMMON?

Data Warehouse / DHBWDaimler TSS 23

Page 24: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

For an MOLAP model with 4 dimensions having 10, 20, 50 and 100 elements and 50 000 facts what is the size of the cube in terms of number of cells?

• 1 000 000 (= 10 * 20 * 50 * 100)

WHAT IS THE SIZE OF THE CUBE IN TERMS OF NUMBER OF CELLS?

Data Warehouse / DHBWDaimler TSS 24

Page 25: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

Which type of OLAP system would you recommend

• for getting fast response times for smaller data sizes

• for achieving an optimal storage utilization

• for an enterprise cube of 10 TB

MOLAP OR ROLAP

Data Warehouse / DHBWDaimler TSS 25

MOLAP

ROLAP

ROLAP

Page 26: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• Precomputation of aggregated values

• Materialized views / query tables store data physically

• Relational Columnar (in-memory) databases

WHICH ROLAP ENHANCEMENTS EXIST?

Data Warehouse / DHBWDaimler TSS 26

Page 27: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

WHICH MONITORING TECHNIQUES EXIST TO DETECT CHANGES IN THE SOURCE?

Data Warehouse / DHBWDaimler TSS 27

Trigger-based Replication techniques

Log-based discovery

Timestamp-based discovery

Snapshot-based discovery

Performanceimpact on source system

Medium Low Low Medium High

Performanceimpact on target system

Low Low Low Low High

Load on network Low Low Low Low High

Data loss if nologgingoperations

No Yes Yes No No

Page 28: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

WHICH MONITORING TECHNIQUES EXIST TO DETECT CHANGES IN THE SOURCE?

Data Warehouse / DHBWDaimler TSS 28

Trigger-based Replication techniques

Log-based discovery

Timestamp-based discovery

Snapshot-based discovery

Identify DELETE operations

Yes Yes Yes No Yes

Identify ALLchanges (changes between extractions)

Yes Yes Yes No No

Page 29: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

WHAT ARE TYPICAL DATA QUALITY PROBLEMS AND POSSIBLE SOLUTIONS?

Data Warehouse / DHBWDaimler TSS 29

Issue Solution

Wrong data e.g. 31.02.2016 Proper data type definition

Wrong values, e.g. number out of range Check constraint

Missing values NOT NULL constraint

Violated references FOREIGN KEY constraint

Duplicates PRIMARY or UNIQUE KEY constraint

Inconsistent data ACID transactions, business logic, additional checks

Page 30: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

WHAT ARE TYPICAL DATA QUALITY PROBLEMS AND POSSIBLE SOLUTIONS?

Data Warehouse / DHBWDaimler TSS 30

Issue Solution

Wrong data e.g. 31.02.2016 Proper data type definition

Wrong values, e.g. number out of range Check constraint

Missing values NOT NULL constraint

Violated references FOREIGN KEY constraint

Duplicates PRIMARY or UNIQUE KEY constraint

Inconsistent data ACID transactions, business logic, additional checks

Page 31: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

Valid time is the time period during which a fact is true in the real world.

Transaction time is the time period during which a fact stored in the database was known.

Bitemporal data combines both Valid and Transaction Time.

Source: (Wikipedia, https://en.wikipedia.org/wiki/Temporal_database)

Supported in SQL standard

WHAT DOES BITEMPORAL DATA MEAN?

Data Warehouse / DHBWDaimler TSS 31

Page 32: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• Indexing

• Partitioning

• Parallelism

• Compression

• Relational In-memory Columnar DB

• Materialized Views

WHICH PERFORMANCE OPTIMIZING TECHNIQUES FOR A DWH EXIST?

Data Warehouse / DHBWDaimler TSS 32

Page 33: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

Top-Down (Inmon)

• Comprehensive approach regarding available data

• Design Core Warehouse Layer = integrated data model first considering all requirements

• Design data marts afterwards

Bottom-Up (Kimball)

• Approach focusing on fast delivery of first results

• Design one data mart first

• Next Marts are modeled afterwards usually using Kimball architecture

• conformed dimensions to integrate different data marts / fact tables

WHICH TWO APPROACHES ARE KNOWN TO BUILD A DWH?

Data Warehouse / DHBWDaimler TSS 33

Page 34: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• Shorter and more frequent ETL cycles (Near Real time ETL)

• Higher requirements for system availability and performance

• Mixed OLAP and OLTP type workload

• Data quality mandatory as data is used for automated decisions

WHAT ARE CHALLENGES FOR OPERATIONAL DWHS?

Data Warehouse / DHBWDaimler TSS 34

Page 35: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

Data Warehouse / DHBWDaimler TSS 35

Page 36: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle

Data Warehouse / DHBWDaimler TSS 36

THANK YOU

Page 37: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

• Example data table

• Hierarchy chart

• Timeline diagrams

• Event matrix

• Enhanced Star schema

WHAT ARE TYPICAL DWH ANALYSIS AND DESIGN WORKPRODUCTS?

Data Warehouse / DHBWDaimler TSS 37

Page 38: LECTURE @DHBW: DATA WAREHOUSE REVIEW ...buckenhofer/20181DWH/Bucken...Department: Big Data Business Unit: Analytics As a 100% Daimler subsidiary, we give 100 percent, always and never

Selection

Slicing

Dicing

Rotate/Pivot

Roll-up/Drill-down

WHICH OPERATION REQUIRES A HIERARCHY?

Data Warehouse / DHBWDaimler TSS 38