176
Ein Unternehmen der Daimler AG Lecture @DHBW: Data Warehouse 03 Architecture Andreas Buckenhofer

Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

  • Upload
    others

  • View
    21

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Ein Unternehmen der Daimler AG

Lecture @DHBW: Data Warehouse

03 Architecture

Andreas Buckenhofer

Page 2: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS GmbH

Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com

Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle

© Daimler TSS I Template Revision

Andreas BuckenhoferSenior DB Professional

Since 2009 at Daimler TSS

Department: Machine Learning Solutions

Business Unit: AnalyticsDHBWDOAG

xing

Contact/Connect

vcard

• Oracle ACE Associate

• DOAG responsible for InMemory DB

• Lecturer at DHBW

• Certified Data Vault Practitioner 2.0

• Certified Oracle Professional

• Certified IBM Big Data Architect

• Over 20 years experience with

database technologies

• Over 20 years experience with Data

Warehousing

• International project experience

Page 3: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS Data Warehouse / DHBW 3

Change Log

Date Changes

01.03.2020 Initial version

Page 4: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS Data Warehouse / DHBW 4

What you will learn today

• Explain and draw different DWH architectures

• Explain and draw Big Data architectures

• Understand Data Lakes and other buzzwords

Page 5: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Purpose: why are DWH/Big Data architectures useful?

• Specific implementation can follow an architecture

• Architecture describes an ideal type. Therefore an implementation may not use all components or can combine components

• Better understanding, overview and complexity reduction by decomposing a DWH into its components

• Can be used in many projects: repeatable, standardizable

• Map DWH tools into the different components and compare functionality

• Functional oriented as it describes data and control flow

5Daimler TSS Data Warehouse / DHBW 5

Page 6: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

1. DWH architectures

• Standard DWH architecture

• Kimball Bus architecture

• Data Vault 2.0 architecture

2. Big Data architectures

• Data Lake

• Lambda architecture

• Kappa architecture

Page 7: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

1. DWH architectures

• Standard DWH architecture

• Kimball Bus architecture

• Data Vault 2.0 architecture

2. Big Data architectures

• Data Lake

• Lambda architecture

• Kappa architecture

Page 8: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Standard Data Warehouse architecture

8

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging

Layer

(Input

Layer)

OLTP

OLTP

Core

Warehouse

Layer

(Storage

Layer)

Mart Layer

(Output

Layer)

(Reporting

Layer)

Integration

Layer

(Cleansing

Layer)

Aggregation

Layer

Metadata Management

Security

DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 8

Page 9: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data sources

• Providing internal and external data out of the source systems

• Enabling data through Push (source is generating extracts) or Pull (BI Data Backend is requesting or directly accessing data)

• Example for Push practice (deliver csv or text data through file interface; Change Data Capture (CDC))

• Example for Pull practice (direct access to the source system via ODBC, JDBC, API and so on)

9Daimler TSS Data Warehouse / DHBW 9

Page 10: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Staging layer

• “Landing Zone” for data coming into a DWH

• Purpose is to increase speed into DWH and decouple source and target system (repeating extraction run, additional delivery)

• Granular data (no pre-aggregation or filtering in the Data Source Layer, i.e. the source system)

• Usually not persistent, therefore regular housekeeping is necessary (for instance delete data in this layer that is few days/weeks old or – more common - if a correct upload to Core Warehouse Layer is ensured)

• Tables have no referential integrity constraints, columns often varchar

10Daimler TSS Data Warehouse / DHBW 10

Page 11: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Integration layer

• Business Rules, harmonization and standardization of data

• Classical Layer for transformations: ETL = Extract – TRANSFORM – Load

• Fixing data quality issues

• Usually not persistent, therefore regular housekeeping is necessary (for instance after a few days or weeks or at the latest once a correct upload to Core Warehouse Layer is ensured)

• The component is often not required or often not a physical part of a DB

Daimler TSS Data Warehouse / DHBW 11

Page 12: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Core Warehouse layer

• Data storage in an integrated, consolidated, consistent and non-redundant (normalized) data model

• Contains enterprise-wide data organized around multiple subject-areas

• Application / Reporting neutral data storage on the most detailed level of granularity (incl. historic data)

• Size of database can be several TB and can grow rapidly due to data historization

• Write-optimised layer

Daimler TSS Data Warehouse / DHBW 12

Page 13: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Aggregation layer

• Preparing data for the Data Mart Layer to the required granularity

• E.g. Aggregating daily data to monthly summaries

• E.g. Filtering data (just last 2 years or just data for a specific region)

• Harmonize computation of key performance indicators (measures) and additional Business Rules

• The component is often not required or often not a physical part of a DB

Daimler TSS Data Warehouse / DHBW 13

Page 14: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Mart layer

• Read-optimised layer: Data is stored in a denormalized data model for performance reasons and better end user usability/understanding

• The Data Mart Layer is providing typically aggregated data or data with less history (e.g. latest years only) in a denormalized data model

• Created through filtering or aggregating the Core Warehouse Layer

• One Mart ideally represents one subject area

• Technically the Data Mart Layer can also be a part of an Analytical Frontend product (such as Qlik, Tableau, or IBM Cognos TM1) and need not to be stored in a relational database

Daimler TSS Data Warehouse / DHBW 14

Page 15: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Metadata management, Security, DWH Manager

• Metadata Management

• Not just “Data about Data”, separate lecture

• Security

• Not all users are allowed to see all data

• Data security classification (e.g. restricted, confidential, secret)

• DWH Manager incl. Monitor

• DWH Manager initiates, controls, and checks job execution

• Monitor identifies changes/new data from source systems, separate lecture

Daimler TSS Data Warehouse / DHBW 15

Page 16: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

1. DWH architectures

• Standard DWH architecture

• Kimball Bus architecture

• Data Vault 2.0 architecture

2. Big Data architectures

• Data Lake

• Lambda architecture

• Kappa architecture

Page 17: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Kimball Bus Architecture (Central Data Warehouse based on data marts)

Source: http://www.kimballgroup.com/2004/03/differences-of-opinion/Daimler TSS Data Warehouse / DHBW 17

Page 18: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Kimball bus architecture (Central Data Warehouse based on data marts)

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer

(Input Layer)

OLTP

OLTP

Core Warehouse Layer

= Mart Layer

Data Mart 1

Data Mart 2Data Mart 3

Metadata Management

Security

DWH Manager incl. Monitor

More Business-

process oriented

than subject-

oriented,

integrated,

time-variant,

non-volatile

Daimler TSS Data Warehouse / DHBW 18

Page 19: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Kimball Bus Architecture (Central Data Warehouse based on data marts)

• Bottom-up approach

• Dimensional model with denormalized data

• Sum of the data marts constitute the Enterprise DWH

• Enterprise Service Bus / conformed dimensions for integration purposes

• (don’t confuse with ESB as middleware/communication system between applications)

• Kimball describes that agreeing on conformed dimensions is a hard job and it’s expected that the team will get stuck from time to time trying to align the incompatible original vocabularies of different groups

• Data marts need to be redesigned if incompatibilities exist

Daimler TSS Data Warehouse / DHBW 19

Page 20: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Co

re W

are

house

Laye

r

Data Integration with and without Core Warehouse layer

Daimler TSS Data Warehouse / DHBW 20

Page 21: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

1. DWH architectures

• Standard DWH architecture

• Kimball Bus architecture

• Data Vault 2.0 architecture

2. Big Data architectures

• Data Lake

• Lambda architecture

• Kappa architecture

Page 22: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Vault 2.0 architecture – today’s world (Dan Linstedt)

Data Warehouse /

DHBW

Page 23: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Vault 2.0 architecture (Dan Linstedt)

Daimler TSS

Michael Olschimke, Dan Linstedt: Building a Scalable Data Warehouse with Data Vault 2.0, Morgan Kaufmann, 2015, Chapter 2.2

Daimler TSS Data Warehouse / DHBW 23

Page 24: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Vault 2.0 architecture (Dan Linstedt)

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer

(Input Layer)

OLTP

OLTP

Raw

Data Vault

Mart Layer

(Output

Layer)

(Reporting

Layer)

Business

Data Vault

Metadata Management

Security

DWH Manager incl. Monitor

Hard

Rules

onlySoft Rules

Daimler TSS Data Warehouse / DHBW 24

Page 25: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Vault 2.0 architecture (Dan Linstedt)

• Core Warehouse Layer is modeled with Data Vault and integrates data by BK (business key) “only” (Data Vault modeling is a separate lecture)

• Business rules (Soft Rules) are applied from Raw Data Vault Layer to Mart Layer and not earlier

• Alternatively from Raw Data Vault to additional layer called Business Data Vault

• Hard Rules don’t change data

• Data is fully auditable

• Real-time capable architecture

• Architecture got very popular recently; also applicable to BigData, NoSQL

Daimler TSS Data Warehouse / DHBW 25

Page 26: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Vault 2.0 architecture (Dan Linstedt)

• In the classical DWHs, the Core Warehouse Layer is regarded as “single version of the truth”

• Integrates + cleanses data from different sources and eliminates contradiction

• Produces consistent results/reports across Data Marts

• But: cleansing is (still) objective, Enterprises change regularly, paradigm does not scale as more and more systems exist

• Data in Raw Data Vault Layer is regarded as “Single version of the facts”

• 100% of data is loaded 100% of time

• Data is not cleansed and bad data is not removed in the Core Layer (Raw Vault)

Daimler TSS Data Warehouse / DHBW 26

Page 27: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Vault 2.0 architecture (Dan Linstedt)

• Data Vault is optimized for the following requirements:

• Flexibility

• Agility

• Data historization

• Data integration

• Auditability

• Bill Inmon wrote in 2008: “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.” (DW2.0)

Daimler TSS Data Warehouse / DHBW 27

Page 28: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS Data Warehouse / DHBW 28

Data Vault – architecture, methodology, model

Lecture

DWH Architectures

Lecture

DWH Data Modeling Implementation: Automation,

Pattern based, High speed

Architecture

• Multi-Tier

• Scalable

• Supports NoSQL

Methodology

• Repeatable

• Measureable

• Agile

Model

• Flexible

• Hash based

• Hub & Spoke

Page 29: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

DWH Examples

Page 30: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS Data Warehouse / DHBW 30

Examples of DWHs in the industry

Apple: multiple Petabytes

• Customer insights: who’s who and what are the customers up to

Walmart: 300TB (2003), several PB today

• It tells suppliers, “You have three feet of shelf space. Optimize it.”

eBay: >10PB, 100s of production DBs fed in

• Get better understanding of customers

Most DWHs are much smaller though. For huge and small DWHs: High challenges to architect + develop + maintain + run such complex systems

Sources: https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/ and

http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/

Page 31: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS Data Warehouse / DHBW 31

Smart Factory: Operational BI Service Platform

Source: Gluchowski: Analytische Informationssysteme, 5.Aufl., p. 279

Workers getting

alarms on their

watch

Containing and

displaying

complex manuals,

e.g. during repair

New data

sources

sending lots

of data with

high speed

(sensor data,

logs, etc.)

Right-Time

data required

for

automated

actions, e.g.

cordless

screwdriver

knows and

adjusts

torque

Page 32: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Digital Twin

• Digital replica of an

entity (product,

service, factory, etc)

• Represent entities

during the whole

lifecycle (including

design phase)

Page 33: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Exercise: recommend an architecture

Review the presented Data Warehouse architectures.

Which architecture would you recommend for

• An online store with real/right-time data integration needs

• Marketing department of a bank

List advantages and drawbacks of your proposal.

Daimler TSS Data Warehouse / DHBW 33

Page 34: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Exercise: recommend an architecture

An online store with real-time/right-time data integration needs

• Architecture: Data Vault 2.0

• + Integration of many internal and external source systems (e.g. integrate social media data about the online store)

• + Fast data delivery in Raw Vault Layer (Real-time/Right-time data integration). Complex data cleansing / transformation / soft rules are delayed downstream towards Mart Layer

• - Transformation overhead (Source system data model to Data Vault data model to Dimensional data model)

Daimler TSS Data Warehouse / DHBW 34

Page 35: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Exercise: recommend an architecture

Marketing department of a bank

• Architecture: Kimball Bus architecture

• + Start small for a department. If other departments are interested, new data and new Marts can be added on demand

• - High risk to loose the Enterprise view and several DWHs are built

That’s still quite a common scenario nowadays. A single Enterprise DWH is often not achieved (e.g. Mergers & Acquisitions, inflexibility due to a single centralized DWH, rapidly changing conditions, etc.) and therefore very often several DWHs with different architectures exist in parallel within a company.

Daimler TSS Data Warehouse / DHBW 35

Page 36: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Summary DWH architectures

Which layers does the standard DWH architecturehave?

• Staging (Input), Integration (Cleansing), Core Warehouse (Storage), Aggregation, Mart (Reporting, Output) and additionally Metadata, Security, DWH Manager + Monitor

Which other architectures exist?

• Kimball Bus Architecture (Central Data Warehouse based on data marts)

• Data Vault 2.0 Architecture (Dan Linstedt)

There are even more

• DW 2.0: The Architecture for the Next Generation of Data Warehousing

• Operational Data Store (ODS)

Daimler TSS Data Warehouse / DHBW 36

Page 37: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

DWH architectures - advantages

Daimler TSS Data Warehouse / DHBW 37

Standard DWH

Single version of the truth

Kimball Bus

Single version of the truth

Data Vault 2.0

Single version of the

facts

• Separation of concerns: long-

term enterprise data storage

separated from data

presentation

• Changes in requirements and

scope are easier to manage

• Lower subsequent

development costs (but

higher startup costs)

• Two layers only mean faster

development and less work

• Rather simple approach to

make data fast and easily

accessible

• Lower startup costs (but

higher subsequent

development costs)

• Separation of concerns:

long-term enterprise data

storage separated from

data presentation

• Store raw data and

postpone time-consuming

data cleansing

• Supports flexibility and

agility

Page 38: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

DWH architectures - disadvantages

Daimler TSS Data Warehouse / DHBW 38

Standard DWH Kimball Bus Data Vault 2.0

• Data model transformations

from 3NF to Dimensional

model required

• More complex as two

different data models are

required

• Larger team(s) of specialists

required

• High effort to start

• Can become inflexible and

expensive in the long-run

• If table structures change (instable

source systems), high effort to

implement the changes and reload

data, especially conformed

dimensions (“Dimensionitis”

desease)

• Non-metric data not optimal for

dimensional model

• Dimensional model (esp. Star

Schema) contains data redundancy

• Difficult to optimize: read or write?

• Data model transformations

from Data Vault model to

Dimensional model required

• More complex as two different

data models are required

Page 39: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

1. DWH architectures

• Standard DWH architecture

• Kimball Bus architecture

• Data Vault 2.0 architecture

2. Big Data architectures

• Data Lake

• Lambda architecture

• Kappa architecture

Page 40: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

1. DWH architectures

• Standard DWH architecture

• Kimball Bus architecture

• Data Vault 2.0 architecture

2. Big Data architectures

• Data Lake

• Lambda architecture

• Kappa architecture

Page 41: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Standard Data Warehouse architecture

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging

Layer

(Input

Layer)

OLTP

OLTP

Core

Warehouse

Layer

(Storage

Layer)

Mart Layer

(Output

Layer)

(Reporting

Layer)

Integration

Layer

(Cleansing

Layer)

Aggregation

Layer

Metadata Management

Security

DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 41

Page 42: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

What are differences between Hadoop and DWH?

Dump question, but actually there are many comparisons like that at the moment

• Hadoop is a tool / technology (or even many tools) like a RDBMS

• DWH is an architecture and concept

• Architecture is abstraction and defines a goal

• Architecture vs tools / technology

Daimler TSS Data Warehouse / DHBW 42

Source: https://de.slideshare.net/mrm0/architecting-a-data-platform-for-enterprise-use-strata-ny-2018

Page 43: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

What is a Data Lake?

Hadoop

Dump anything in and wait?

Hoard 100ths of

Petabyte in HDFS?

Daimler TSS Data Warehouse / DHBW 43

Page 44: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Assumptions + requirements have changed

• The data is not always known in advance, so it can’t be modeled in advance.

• A variety of data formats are coming like JSON, pdf, text, audio, video, voice, etc.

• The data architecture must be read-write from both the back and front, not a one-way data flow. The data written back may be repeatedly used, persistent data, or it may be temporary.

• The data may arrive with any frequency, and the rate may not be under your control.

Daimler TSS Data Warehouse / DHBW 44

Page 45: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

It‘s very hard to get speed and quality – choose one

Stability & Quality

•Predictable

• Improving

•Save user time

Flexibility & Speed

•Explanatory

•Experimenting

•Save developer time

Source: https://www.gartner.com/it-glossary/bimodal/

Daimler TSS Data Warehouse / DHBW 45

Page 46: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Lake on Hadoop

Data Swamp

Data Reservoir

Landing Zone

Data Library

Data Repository

Data Archive

Data Lake on Spark

Data Lake 3.0

Page 47: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS Data Warehouse / DHBW 47

Landing Zone – Definition by Dan Linstedt

Source: https://datavaultalliance.com/news/defining-a-data-lake/

Page 48: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS Data Warehouse / DHBW 48

Data Lake – Definition by Dan Linstedt

Source: https://datavaultalliance.com/news/defining-a-data-lake/

Page 49: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS Data Warehouse / DHBW 49

Data LakeData Vault 2.0 systems architecture by Dan Linstedt

Source: https://datavaultalliance.com/news/defining-a-data-lake/

Page 50: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data LakeZones instead of layers according to Mark Madsen

New data of unknown value,

simple requests for new data

can land here first, with little

work by IT. Typically

schema-on-read.

More effort applied to data

management: cleansed,

curated. Typically schema-

on-write.

Optimized for specific uses /

workloads. Personal folders.

Schema-on-write.

Daimler TSS Data Warehouse / DHBW 50

Source: https://de.slideshare.net/mrm0/architecting-a-data-platform-for-enterprise-use-strata-ny-2018

Page 51: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Lake architecture - summary

Data Lake

FrontendBackend

External data sources

Internal data sources

OLTP

OLTP

Standardized data (enhanced data)

Sandbox (temporary data)

Metadata Management

Security

Data Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 51

Persistent Landing Zone

(raw Data)

DWH

Page 52: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Characteristics of a Data Lake architecture

Characteristics of a Data Lake architecture:

• Is managed, governed, organized

• Deals with data and schema change easily

• Does not require up front modeling

• Does not limit the format or structure of data

• Assumes a full range of data latencies, from streaming to one-time bulk loads

• Scales easily for higher data volumes

• Leverages lineage and metadata

Daimler TSS Data Warehouse / DHBW 52

Page 53: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS Data Warehouse / DHBW 53

Old wine in new skins

Source: https://twitter.com/nheudecker/status/1110252515931508736?s=11

Page 54: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

1. DWH architectures

• Standard DWH architecture

• Kimball Bus architecture

• Data Vault 2.0 architecture

2. Big Data architectures

• Data Lake

• Lambda architecture

• Kappa architecture

Page 55: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Lambda architecture

• Architecture by Nathan Marz

• Combine realtime and batch processing

• Batch layer stores and historizes raw data

• Serving layer contains batch views

• Query unions serving and realtime layer

• Rather complex

• Author recommends graph data model and advises against schema-on-read

Daimler TSS Data Warehouse / DHBW 55

Page 56: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

What is the main inevitable problem in data systems?

• Lack of human fault tolerance

• Bugs will be deployed

• Operational errors will happen, e.g. accidentally delete data

• Data loss / corruption is worst case scenario

• Without data loss / corruption, mistakes can be fixed if original data is still available

• Must design for human error like you’d design for any other fault

Daimler TSS Data Warehouse / DHBW 56

Page 57: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Analytics vs Re-Analytics

“My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.”–Jules J. Berman.

Source: http://www.odbms.org/blog/2014/07/big-data-science-interview-jules-j-berman/

Daimler TSS Data Warehouse / DHBW 57

Page 58: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Immutable data - Master data set

• Atomic, immutable data

• Ensure that data can not be deleted (accidentally)

• Fundamentally simpler

• Easy to implement on top of a distributed filesystem, eg Hadoop

• CR instead of CRUD

• No updates

• No deletes

• Create (insert) and read (select) only

Immutability restricts the range of errors that can cause data loss/corruption

Daimler TSS Data Warehouse / DHBW 58

Page 59: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

What is the Lambda architecture?

Batch Layer

All Data (Master data set)

Speed Layer

RealTime Views

Serving Layer

Batch Views

Query(merge)

Data StreamDaimler TSS Data Warehouse / DHBW 59

Page 60: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Lambda architecture – data examplecompute follower list

Batch

Layer

1.1. insert Jim

1.1. insert Anne

2.1 remove Jim

3.1. insert George

5.1. insert John (now)

Speed Layer

insert John (now)

Serving

LayerAnne

George

Query

Data Stream

Anne

George

John

Follower: 3

Daimler TSS Data Warehouse / DHBW 60

Page 61: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Batch views and realtime views

Daimler TSS Data Warehouse / DHBW 61

Page 62: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data absorbtion

https://dzone.com/articles/Lambda-architecture-with-apache-spark

Daimler TSS Data Warehouse / DHBW 62

Page 63: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Eventual accuracy

Computing best answer in real time may not always be possible

• Can compute exact answer in batch layer and approximate answer in realtime layer

• Best of both worlds of performance and accuracyFor example, a machine learning application where generation of the batch model requires so much time and resources that the best result achievable in real-time is computing and approximated updates of that model. In such cases, the batch and real-time layers cannot be merged, and the Lambda architecture must be used.

Daimler TSS Data Warehouse / DHBW 63

Page 64: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Which tools could be used in the Lambda architecture?

Batch Layer

All Data

Speed Layer

RealTime Views

Serving Layer

Batch Views

Query(merge)

Data StreamDaimler TSS Data Warehouse / DHBW 64

Page 65: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Cloud vendors offerings, e.g. Microsoft Azure

https://social.technet.microsoft.com/wiki/contents/articles/33626.Lambda-architecture-implementation-using-microsoft-azure.aspx

Daimler TSS Data Warehouse / DHBW 65

Page 66: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Cloud vendors offerings, e.g. Amazon AWS

https://aws.amazon.com/de/blogs/big-data/unite-real-time-and-batch-analytics-using-the-big-data-Lambda-architecture-without-servers/

https://d0.awsstatic.com/whitepapers/lambda-architecure-on-for-batch-aws.pdfDaimler TSS Data Warehouse / DHBW 66

Page 67: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Lambda architecture – pros and cons

Pro Con

Architecture emphasizes to keep data

immutable. Mistakes can be corrected via

recomputation

Maintaining code that needs to produce

the same result in two complex distributed

systems

Reprocessing is one of the key challenges

of stream processing but is very often

ignored

Operational burden of running and

debugging two systems

Daimler TSS Data Warehouse / DHBW 67

Page 68: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Summary Lambda architecture

• Incoming data is sent to batch and speed layer

• Batch layer constantly (re-) computes batch views

• Master data is stored in the batch layer in raw format: immutable & append-only

• Contains data except most up-to-date data due to high latency

• Replaces data in speed layer as soon as data in newer compared to speed layer

• Speed layer uses incremental algorithms to refresh real-time Views

• Receives data stream for real-time processing

• Contains most up-to-date data only

• Serving layer contains views on batch layer data

• Merge is done in the query or in the serving layerDaimler TSS Data Warehouse / DHBW 68

Page 69: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

1. DWH architectures

• Standard DWH architecture

• Kimball Bus architecture

• Data Vault 2.0 architecture

2. Big Data architectures

• Data Lake

• Lambda architecture

• Kappa architecture

Page 70: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Questioning the Lambda architecture

Jay Kreps wrote an article about „Questioning the Lambda architecture“

• He wrote about his experience with the Lambda architecture

• It works, but not very pleasant or productive

• Keeping code in sync is really hard

• Need to build complex, low-latency processing systems

• Scalable high-latency batch system

• Low-latency stream stream-processing system

• Instead of duct taping batch & speed:

Kappa architecture / Log-centric architecture / Stream data platform

Source: https://www.oreilly.com/ideas/questioning-the-Lambda-architecture

Daimler TSS Data Warehouse / DHBW 70

Page 71: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Kappa architecture

• Architecture by Jay Kreps

• Logcentric, write-ahead logging

• Each event is an immutable log entry and is added to the end of the log

• Read and write operations are separated

• Materialized views can be recomputed consistently from data in the log

Daimler TSS Data Warehouse / DHBW 71

Page 72: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

72

What is a message? A log!

Source: Jay Kreps: I heart logs, O’Reilly 2014

Daimler TSS Data Warehouse / DHBW

Page 73: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

73

What is a log?

• A bank account’s current balance can be built from a complete list of its debits and credits, but the inverse is not true.

• In this way, the log of transactions is the more “fundamental” data structure than the database records storing the results of those transactions.

A software application’s database is better thought of as a series of time-ordered immutable facts collected since that system was born, instead of as a current snapshot of all data records as of right now.Source: https://blog.parse.ly/post/1550/kreps-logs/

Daimler TSS Data Warehouse / DHBW

Page 74: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Kappa architecture

Source: https://www.oreilly.com/ideas/questioning-the-Lambda-architecture

Daimler TSS Data Warehouse / DHBW 74

Page 75: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Reprocessing

Lambda architecture

• Complex

• It works, but not very pleasant or productive: keeping code in sync is really hard

• Addresses the challenge of reprocessing

• Batch views to contain updated data (a Lambda architecture rcommendation)

• Bugs (true requirement for reprocessing) new processing logic

• New user requirements (true requirement for reprocessing) new processing logic

Kappa architecture

• Logs unify Batch processing + Stream processing

• In Kappa, reprocessing required only when processing logic has been modified

Daimler TSS Data Warehouse / DHBW 75

Page 76: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Kappa architecture – pros and cons

Pro Con

Flexible architecture that can be used

• stand-alone

• combined with a Data Lake or DWH

• in a microservice architecture

All use cases must be mapped to a

streaming problem which can become

artificial for well-known cases, e.g. sales

prognosis.

Easy to understand Not suited for analysis of history data:

sequential reading and writing is fast,

random is not.

Daimler TSS Data Warehouse / DHBW 76

Page 77: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Summary: Enterprise-wide architecture

Source: Jay Kreps: I heart logs, O’Reilly 2014

Daimler TSS Data Warehouse / DHBW 77

Page 78: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

1. DWH architectures

• Standard DWH architecture

• Kimball Bus architecture

• Data Vault 2.0 architecture

2. Big Data architectures

• Data Lake

• Lambda architecture

• Kappa architecture

Page 79: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS Data Warehouse / DHBW 79

Architecture - summary

One size does not fit all

• Understand why an architecture is needed

• Use + work on an architecture

• Don’t reinvent the wheel

• But adapt to your needs and constraints (time, budget, skills, etc)

Page 80: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS Data Warehouse / DHBW 80

Theory vs Reality: Gather all data in one place

Source : https://www.linkedin.com/feed/update/urn:li:activity:6440630785102880768/

Page 81: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Virtual architecture

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Data

Lake

OLTPQuery Management

Weakly+partly subject-oriented,

Weakly+partly integrated,

Not time-variant,

Not non-volatile

Views,

Alias names,

Metadata

DWH

Daimler TSS Data Warehouse / DHBW 81

Page 82: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Virtual architecture

• Data not extracted from operational systems and stored separately

• Standardized interface for all operational data sources

• One "GUI" for all existing data

• Generates combined queries

• Query Processor joins query result data from different sources

• Can also access data in Hadoop (Polybase, Big SQL, BigData SQL, etc)

Daimler TSS Data Warehouse / DHBW 82

Page 83: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Virtual architecture

• Query Management manages metadata about all operational systems

• (physical) location of data and algorithms for extracting data from OLTP system

• Implementation easier

• Low cost: can use existing hardware infrastructure

• Queries cause significant performance problems in operational systems

• Known problems when analyzing operational data directly

• Same query is processed multiple times (if queried multiple times)

• Same query delivers different results when processed at different times

Daimler TSS Data Warehouse / DHBW 83

Page 84: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Exercise - Architecture

• Now imagine that you prepare an exam.

• Identify 1 question about DWH / Big Data architecture that you would ask in an exam.

• Write down the question on a stick-it card.

Daimler TSS Data Warehouse / DHBW 84

Page 85: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Daimler TSS GmbH

Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com

Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle

© Daimler TSS I Template Revision

Page 86: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Architecture from a project in the automotive industry

ETL Engine

Fro

nte

nd

Standard

Reports

AdHoc

ReportsLogs

TSM

IIDR

ReplEngine

Source

Datastore

Source

Mirror DB

(Operational

Data Store)

OLTP

DB

IIDR ReplEngine

Mirror

Datastore

Mirror

IIDR ReplEngine

DWH

Datastore

DWH

Backend

DWH DB

Staging Layer

Raw + Business Data Vault

Mart Layer

Daimler TSS Data Warehouse / DHBW 86

Page 87: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

End user sample questions

Which vehicles or aggregates are documented incompletely? (Data quality)

Which vehicles / which control units require SW updates?

Which interiors are most common by region?

Supply data for external simulations, customs clearance, spare part planning, etc.

Daimler TSS Data Warehouse / DHBW 87

Page 88: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data management according to BARCFlexibility causes heterogenity

Source: https://www.datafestival.de/events/proof-of-concepts-als-strategietool-erfahrungen-aus-der-medienbranche-de/; Jacqueline Bloemen, Datenarchitektur für Business Analytics – was Sie

berücksichtigen sollten, DataFestival 2019

Daimler TSS Data Warehouse / DHBW 88

Page 89: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Use case: Analysis battery aging

• JSON data ingested into HDFS, Hive tables on JSON files

• Identify breaks (“> 8h”) and compute current drain

Max capacity

Current capacity

Daimler TSS Data Warehouse / DHBW 89

Page 90: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Structuring the Data Lakenew data sources – sensor data

• Sensor data format change without notice

• Sensors get regularly updated with new versions

• Names of metrics may change

• Sensors with various versions in the field

• Sensors from different suppliers

• Often many fields >>100 and increasing with new sensor versions

• Easy storing of data in HDFS and applying schema later

• Data from Robots, vehicles, …

Daimler TSS Data Warehouse / DHBW 90

Page 91: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Use case: Analysis battery agingData Lake architecture

Structuring the Data Lake “schema-on-read”

• Sensor data format change without notice

• Time consuming and error-prone data integration into the Data Lake

• Therefore preparation of data for usage in the Consumption area required: “Data Engineer”

Raw dataD

ata

Gove

rnan

ce

Consumption

Enhanced data

Meta

data

Managem

ent

Data

Arc

hiva

l

Data

Security

json

Samp-

ling /

filter

Hive tables

Hive tables

Struc-

ture

R Python

Daimler TSS Data Warehouse / DHBW 91

Page 92: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Other DWH architectures

• Inmon Corporate Information Factory

• DW 2.0: The Architecture for the Next Generation of Data Warehousing

• Operational Data Store (ODS)

• Virtual architecture

Daimler TSS Data Warehouse / DHBW 92

Page 93: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Exercise: classical DWH architectures

The article

http://www.kimballgroup.com/2004/03/differences-of-opinion/

compares THE two classic DWH architectures.

Read the paper and complete the table / questions on the next slide.

(Caution: The paper is biased / favors one approach; you may want to read other/more papers for a neutral view.)

Daimler TSS Data Warehouse / DHBW 93

Page 94: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Exercise: classical DWH architectures

How are the approaches

called?

Who “invented” the approach?

How many layers are used and how are

the layers called?

Which data modeling approaches are

used in which layer?

In which layer are atomic detail data

stored?

In which layer are aggregated / summary

data stored?

List at least 2 advantages

List at least 2 disadvantages

Page 95: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Exercise: classical DWH architectures

How are the

approaches called?

Kimball Bus Architecture Corporate Information Factory

Who “invented” the approach? • Ralph Kimball • Bill Inmon

How many layers are used and

how are the layers called?

• Data Staging

• Dimensional Data Warehouse

• Data Acquisition

• Normalized Data Warehouse

• Data Delivery / Dimensional Mart

Which data modeling approaches

are used in which layer?

• Data Staging: variable, corresponds to

source system

• Dimensional Data Warehouse:

Dimensional Model

• Data Acquisition: variable, corresponds to

source system

• Normalized Data Warehouse: 3NF

• Data Delivery: Dimensional Model

In which layer are atomic detail

data stored?

• Dimensional Data Warehouse • Normalized Data Warehouse

In which layer are aggregated /

summary data stored?

• Dimensional Data Warehouse • Data Delivery / Dimensional Mart

Daimler TSS Data Warehouse / DHBW 95

Page 96: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Exercise: classical DWH architectures

Kimball Bus Architecture Corporate Information Factory

Advantages • Two layers only mean faster development and

less work

• Rather simple approach to make data fast and

easily accessible

• Lower startup costs (but higher subsequent

development costs)

• Separation of concerns: long-term enterprise data

storage separated from data presentation

• Changes in requirements and scope are easier to

manage

• Lower subsequent development costs (but higher

startup costs)

Disadvantages • If table structures change (instable source

systems), high effort to implement the changes

and reload data, especially conformed

dimensions (“Dimensionitis” desease)

• Non-metric data not optimal for dimensional

model

• Dimensional model (esp. Star Schema) contains

data redundancy

• Data model transformations from 3NF to

Dimensional model required

• More complex as two different data models are

required

• Larger team(s) of specialists required

Daimler TSS Data Warehouse / DHBW 96

Page 97: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Inmon Corporate Information Factory

Source: http://www.kimballgroup.com/2004/03/differences-of-opinion/Daimler TSS Data Warehouse / DHBW 97

Page 98: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Inmon Corporate Information Factory

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer

(Input Layer)

OLTP

OLTP

Core

Warehouse

Layer

(Storage

Layer)

Mart Layer

(Output Layer)

(Reporting Layer)

Metadata Management

Security

DWH Manager incl. Monitor

subject-

oriented,

integrated,

time-

variant,

non-volatile

Daimler TSS Data Warehouse / DHBW 98

Page 99: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Inmon Corporate Information Factory

• Top-down approach

• (Normalized) Core Warehouse is essential for subject-oriented, integrated, time-variant and nonvolatile data storage

• Create (departmental) Data Marts as subsets of Core Enterprise DWH as needed

• Data Marts can be designed with Dimensional model

• The standard DWH architectureis more general compared to CIF, but was mainly influenced by CIF

Daimler TSS Data Warehouse / DHBW 99

Page 100: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

DW 2.0: The Architecture for the Next Generation of Data Warehousing

Source: W.H. Inmon, Dan Linstedt: Data Architecture: A Primer for the Data Scientist, Morgan Kaufmann, 2014, chapter 3.1

Operational application

data model

Integrated corporate

data model

Integrated corporate

data model

Archival

data model

Data

Lifecyc

le

Daimler TSS Data Warehouse / DHBW 100

Page 101: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

DW 2.0: The Architecture for the Next Generation of Data Warehousing

Main characteristics:

• Structured and “unstructured” data, not just metrics

• Life Cycle of data with different storage areas

• Hot data: High speed, expensive storage (RAM, SSDs) for most recent data

• …

• Cold data: Low speed, inexpensive storage (e.g. hard disks) for old data; archival data model with high compression

• Metadata is an integral part of the DWH and not an afterthought

Daimler TSS Data Warehouse / DHBW 101

Page 102: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Operational data store (ODS)

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer

(Input Layer)

OLTP

OLTP

Core

Warehouse

Layer

(Storage

Layer)

Mart Layer

(Output Layer)

(Reporting Layer)

Metadata Management

Security

DWH Manager incl. Monitor

subject-

oriented,

integrated,

time-

variant,

non-volatile

Operational Data Store

Daimler TSS Data Warehouse / DHBW 102

Page 103: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Operational data store (ODS)

• ODS: Real-time/Right-time layer

• Replication techniques used to transport data from source database to ODS layer with minimal impact on source system

• Data in the ODS has no history and is stored without any cleansing and without any integration (1:1 copy from single source)

• DWH performance not optimal as data model is suited for OLTP and not for reporting requirements

• ODS normally additionally to Staging / Core DWH / Mart Layer but can exist alone without other layers

Daimler TSS Data Warehouse / DHBW 103

Page 104: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Logical architecture (Gartner)

• The LDW is a multi-server / multi-engine architecture

• The LDW can have multiple engines, the main ones being: The Data Warehouse(DW), data marts and the Data Lake

• We should not be trying to choose a single engine for all our requirements. Instead, we should be sensibly distributing all our requirements across the various components we have to choose from. Reasons:

• Regulatory constraints

• Organizational constraints

https://blogs.gartner.com/henry-cook/2018/01/18/logical-data-warehouse-project-plans/Daimler TSS Data Warehouse / DHBW 104

Page 105: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Specific Big Data architectures

• There exist well-known reference architectures for Data Warehouses

• Many tools and schema-on-read came with the Hadoop ecosystem

• Was a “black box” at the beginning

• Gets more and more structure with different layers instead of a “black box”

• Structure, modeling, organization, governance instead of tool-only focus

Daimler TSS Data Warehouse / DHBW 105

Page 106: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

How does a Data Lake differ from a data swamp?

• Naive idea: dump everything in (“landing zone”)

• Data hoarding is not a data management strategy

• A Data Lake brings in structure

• e.g. create directories in HDFS if Hadoop is used

• /transient

• /raw

• /standardized

• /use-case specific

Daimler TSS Data Warehouse / DHBW 106

Page 107: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

What is a Data Lake according to Mark Madsen?separate collect / manage / delivery

Daimler TSS Data Warehouse / DHBW 107

Page 108: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Schema-on-write vs schema-on-read revisited

Old approach New approach

Model Collect

Collect Model

Analyze Analyze

Promotemove data into next

zone if required (data

has been analyzed and

is needed more often)

Daimler TSS Data Warehouse / DHBW 108

Page 109: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Vault 2.0 architecture – today’s world (Dan Linstedt)

Page 110: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Defining a Data Lake … by Dan Linstedt

https://www.youtube.com/watch?v=tDNjI1Yvqxw

Page 111: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Interview with “Father of data warehousing” Bill Inmon

https://www.youtube.com/watch?time_continue=48&v=b9YfsjEjVS8

Page 112: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Lake turned into Data swamp

Source: Ungerer: Cleaning Up the Data Lake with an Operational Data Hub, O’Reilly Media 2018, p.12

Daimler TSS Data Warehouse / DHBW 112

Page 113: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Lake

Source: Ungerer: Cleaning Up the Data Lake with an Operational Data Hub, O’Reilly Media 2018, p.11

• No agreed definition

• Characteristics [Madsen]:

• Deals with data and schema

change easily

• Does not always require up front

modeling

• Does not limit the format or

structure of data

• Assumes a full range of data

latencies, from streaming to

one-time bulk loads, both in and

out including write-backDaimler TSS Data Warehouse / DHBW 113

Page 114: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Bimodal data governance

Two different styles of work: different zones with different degrees ofgovernance

• Stability & Quality

• Predictable, improving and renovating

• Flexibility & Speed

• Explanatory

Source: https://www.gartner.com/it-glossary/bimodal/

Daimler TSS Data Warehouse / DHBW 114

Page 115: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Exercise: recommend an architecture

A holding of 3 telecommunication companies

• Architecture: Virtual Data Warehouse

• + Companies may not want to provide their data to a new storage

• + Can easily be extended if new companies join the holding or reduced if a company leaves the holding

• - Bad performance

• - Not really data integration achieved, low Data Quality

• - Firewalls have to be opened

Daimler TSS Data Warehouse / DHBW 115

Page 116: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Where to use Hadoop and spark?

Source: Rick F. van der Lans: New Data Storage Technologies, TDWI Munich 2018

Page 117: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Where to use nosql?

Source: Rick F. van der Lans: New Data Storage Technologies, TDWI Munich 2018

Page 118: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Examples of Data Warehouses in the industry

Apple: multiple Petabytes

• Customer insights: who’s who and what are the customers up to

Walmart: 300TB (2003), several PB today

• It tells suppliers, “You have three feet of shelf space. Optimize it.”

eBay: >10PB, 100s of production DBs fed in

• Get better understanding of customers

Most DWHs are much smaller though. For huge and small DWHs: High challenges to architect + develop + maintain + run such complex systemshttps://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/ and http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/

Daimler TSS Data Warehouse / DHBW 118

Page 119: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

•Architecture, conceptData Lake

•Tools (that can be used to implement a Lake)

Hadoop, Spark, Elastic Stack

Data Lake vs Hadoop

Daimler TSS Data Warehouse / DHBW 119

Page 120: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

DWH and Data Lake

DWH on RDBMS

Slowly Changing Dimension

ELT vs ETL

3-Layer vs 2-Layer

Kimball Approach

Inmon Definition

Star Schema

Data Vault

Anchor Modeling

etc

Data Lake on Hadoop

Schema-on-Read

Agility

Parquet

Hive

HBase

SQL-on-Hadoop

Impala

Oozie

Zoekeeper

Methods,

Concepts,

Techniques

Tools,

Tools,

Tools

Daimler TSS Data Warehouse / DHBW 120

Page 121: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

The (ab)use of the words „Data Lake“technical

• A physical implementation of data solutions in the Hadoop ecosystem.

• A reservoir of curated, and often unconnected, datasets for data science, data exploration and reference data management use.

• The new Operational Data Store, a pub flap to hide a proper data management architecture.

• A name to indicate the staging area or the persistent staging area of a Data Warehouse solution.

• A technical solution to process huge volumes of data, for which an RDBMS is often too expensive to deploy.

Daimler TSSSource: https://www.linkedin.com/pulse/abuse-words-data-lake-martijn-ten-napel/Daimler TSS Data Warehouse / DHBW 121

Page 122: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

The (ab)use of the words „Data Lake“cynical reality

• A concept that will spray fairy dust upon data management issues that arise from organisational issues, politics or a lack of understanding what working with data requires of an organisation.

• A term to use when you are really out of your comfort zone as an architect and to shelter behind.

• A clever marketing and sales trick, used to sell you a “must have” solution next to what you already have. You are told that without a Data Lake you will be unable to do data science, digital transformation or business analytics.

Source: https://www.linkedin.com/pulse/abuse-words-data-lake-martijn-ten-napel/Daimler TSS Data Warehouse / DHBW 122

Page 123: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Lake (Martin fowler)

Source: https://martinfowler.com/bliki/DataLake.htmlDaimler TSS Data Warehouse / DHBW 123

Page 124: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Lake (Martin fowler)

Source: https://martinfowler.com/bliki/DataLake.htmlDaimler TSS Data Warehouse / DHBW 124

Page 125: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

What is a Data Lake?

A Data Lake acquires data from multiple sources in an enterprise in its native form and may also have internal, modeled forms of this same data for various purposes. The information thus handled could be any type of information, ranging from structured or semi-structured data to completely unstructured data. A Data Lake is expected to be able to derive enterprise-relevant meanings and insights from this information using various analysis and machine learning algorithms.

Source: Pankaj Misra, Tomcy John: Data Lake for Enterprises Packt 2017

Daimler TSS Data Warehouse / DHBW 125

Page 126: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Lake life cycle

Source: Pankaj Misra, Tomcy John: Data Lake for Enterprises Packt 2017Daimler TSS Data Warehouse / DHBW 126

Page 127: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data Lake (eckerson group)

Source: https://www.eckerson.com/articles/ten-characteristics-of-a-modern-data-architecture

Daimler TSS Data Warehouse / DHBW 127

Page 128: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

New approach to build data systems

Immutable data

(Source of Truth /

Single version of

facts)

(Materialized) Views

on data

(Materialized) Views

on data

(Materialized) Views

on data(Materialized) Views

on data

(Materialized) Views

on data

Query = Application

HDFS / NoSQL

RDBMS

NewSQL(Materialized) Views

on data

(Materialized) Views

on data

(Materialized) Views

on data(Materialized) Views

on data

(Materialized) Views

on data

Query = Application

Daimler TSS Data Warehouse / DHBW 128

Page 129: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Lambda architecture

Source:

Batch Layer

Batch

Engine Serving Layer

Serving

Backend Queries

Raw history

data

Result

data

Real-Time Layer

Real-Time

Engine

Daimler TSS Data Warehouse / DHBW 129

Page 130: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Overview Lambda architecture

https://de.slideshare.net/gschmutz/big-data-and-fast-data-Lambda-architecture-in-actionDaimler TSS Data Warehouse / DHBW 130

Page 131: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Which tools could be used in the Lambda architecture? db requirements

Easier to implement for

database vendors

compared to random

access

Daimler TSS Data Warehouse / DHBW 131

Page 132: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Batch layer

Write sequential once

Bulk sequential read many

times

Speed layer

Random write

Random read

Servinglayer

Batch write

Random read

Which tools could be used in the Lambda architecture? db requirements

More challenging

More challenging More challenging

Daimler TSS Data Warehouse / DHBW 132

Page 133: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

What are the three paradigms of programming?

• Request/Response

• Batch

• Stream processing

Source: https://de.slideshare.net/JayKreps1/distributed-stream-processing-with-apache-kafka-71737619Daimler TSS Data Warehouse / DHBW 133

Page 134: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Enterprise-wide architectureusing log-centric approach

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/Daimler TSS Data Warehouse / DHBW 134

Page 135: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Enterprise-wide architectureusing log-centric approach and CDC

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/Daimler TSS Data Warehouse / DHBW 135

Page 136: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Enterprise-wide architecture todaydata integration challenge

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/Daimler TSS Data Warehouse / DHBW 136

Page 137: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Enterprise-wide architecture todaydata integration challenge

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/Daimler TSS Data Warehouse / DHBW 137

Page 138: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Enterprise-wide architecture todaydata integration challenge

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/Daimler TSS Data Warehouse / DHBW 138

Page 139: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Log centric architecture / Kappa architecture / data streaming platform

Source: Jay Kreps: I heart logs, O’Reilly 2014Daimler TSS Data Warehouse / DHBW 139

Page 140: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

What is stream processing?

Most common understanding (doing the „T“ in ETL):

• Stream processing is the parallel processing of data in motion = computing on data directly as it is produced or received.

• Not necessarily transient, approximate, lossy (assumptions from Lambda architecture and other event processing systems)

Daimler TSS Data Warehouse / DHBW 140

Page 141: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Messaging systems @Linkedin

1st attempt

• Active MQ, RabbitMQ

• Problems:

• Not distributed

• Throughput

• Persistence

• Ordering

2nd attempt

• Kafka

• Key abstraction: Logs

• build from scratch

• Distributed system by design

• Partitioning with local ordering

• Elastic scaling

• Fault tolerance

Daimler TSS Data Warehouse / DHBW 141

Page 142: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Kafka: a modern distributed system for streams

• Scalability

• Hundreds of MB/sec/server throughput

• Many TB per node

• Guarantees of a database• All messages strictly ordered (within a partition)

• All data persistent

• Distributed by default• Replication

• Partitioning

• Producers + consumers all fault tolerant and horizontally scalable

Daimler TSS Data Warehouse / DHBW 142

Page 143: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

ETL revisited with kafka connect and kafka streams

Source: https://de.slideshare.net/JayKreps1/distributed-stream-processing-with-apache-kafka-71737619Daimler TSS Data Warehouse / DHBW 143

Page 144: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

What are stream processing frameworks?

• Apache Spark streaming

• Apache Storm (twitter, Nathan Marz)

• Apache Samza (linkedin)

• Apache Flink

• Apache Kafka Streams

• Simple library

• Reprocessing

• No microbatch = everything is a stream

• Local state

• Key operations: filter, aggregate, join

Daimler TSS Data Warehouse / DHBW 144

Page 145: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Kafka @Linkedin

Source: https://de.slideshare.net/JayKreps1/i-32858698Daimler TSS Data Warehouse / DHBW 145

Page 146: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Summary: Stream data platform using Kafkaunifying batch and stream processing

Source: https://de.slideshare.net/JayKreps1/i-32858698Daimler TSS Data Warehouse / DHBW 146

Page 147: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Kafka and microservices

• A message-oriented implementation requires an efficient messaging backbone that facilitates the exchange of data in a reliable and secure way with the lowest latency possible.

• Creating small, self-contained, data-driven applications that meld streaming data and microservices together is a good practice to break down large problems and projects into approachable chunks, reduce risk, and deliver value faster.

• Think of combinations of data-processing applications with microservices to deliver specific features and insights from a data stream.

Daimler TSS Data Warehouse / DHBW 147

Page 148: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Kappa architecture

Source:

Real-Time Layer

Real-Time

Engine

Serving Layer

Serving

BackendData Queries

Raw history

data

Result

data

Daimler TSS Data Warehouse / DHBW 148

Page 149: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

PS-3C architecture

Data Library

Storage

Engine

3C Layer

Preparation

EngineData Queries

Serving Layer

Delivery

Engine

Raw history

data

Integrated

subjects

Result

data

Daimler TSS Data Warehouse / DHBW 149

Page 150: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

PS-3C architecture

• Architecture by Rogier Werschkull

• Store incoming data in Data Library Layer (Persistent staging = PS)

• Prepare data in a 3C layer for “Concept – Context – Connector”-model

• Concept + Connector can be virtualized on data in Data Library Layer

Daimler TSS Data Warehouse / DHBW 150

Page 151: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Polyglot warehouse

• Architecture by Joe Caserta

• Big Data Warehouse may live in one or more platforms on premises or in the cloud

• Hadoop only

• Hadoop + MPP or RDBMS

• Additionally NoSQL or Search

Daimler TSS Data Warehouse / DHBW 151

Page 152: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Polyglot warehouse

Source: https://www.slideshare.net/CasertaConcepts/Hadoop-and-your-data-warehouse

Daimler TSS Data Warehouse / DHBW 152

Page 153: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

The extended Data Warehouse architecture (xdw)The Enterprise Analytics Architecture

• Architecture by Claudia Imhoff

• combine the stability and reliability of the BI architectures while embracing new and innovative technologies and techniques

• 3 components that extend the EDW environment

• Investigative computing platform

• Data refinery

• Real-time (RT) analysis platform

Data Warehouse /

DHBWDaimler TSS153Daimler TSS Data Warehouse / DHBW 153

Page 154: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

The extended Data Warehouse architecture (xdw)The Enterprise Analytics Architecture

Source: https://upside.tdwi.org/articles/2016/03/15/extending-traditional-data-warehouse.aspx

Daimler TSS Data Warehouse / DHBW 154

Page 155: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Gartner Data Lake architecture styles

• Inflow Lake: accommodates a collection of data ingested from many different sources that are disconnected outside the lake but can be used together by being colocated within a single place

• Outflow Lake: a landing area for freshly arrived data available for immediate access or via streaming. It employs schema-on-read for the downstream data interpretation and refinement.

• Data Science Lab: most suitable for data discovery and for developing new advanced analytics models

Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/

Daimler TSS Data Warehouse / DHBW 155

Page 156: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Gartner Data Lake architecture styles

Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/

Daimler TSS Data Warehouse / DHBW 156

Page 157: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Gartner – the logical DWH

https://blogs.gartner.com/henry-cook/2018/01/28/the-logical-data-warehouse-and-its-jobs-to-be-done/

https://blogs.gartner.com/henry-cook/2018/01/28/the-logical-data-warehouse-and-its-jobs-to-be-done/

Daimler TSS Data Warehouse / DHBW 157

Page 158: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Summary

Landing Area

Storage

Engine

Data Lake

Integration

EngineData Queries

Data Presentation

Delivery

Engine

Raw history

dataLightly

integrated data

Result

data

Daimler TSS Data Warehouse / DHBW 158

Page 159: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data core – Dave Wells

• Architecture by Eckerson Group

• DWHs exist together with MDM, ODS, and portions of the Data Lake as a collection of data that is curated, profiled, and trusted for enterprise reporting and analysis

https://www.eckerson.com/articles/the-future-of-the-data-warehouse

Daimler TSS Data Warehouse / DHBW 159

Page 160: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data core – Dave Wells

https://www.eckerson.com/articles/the-future-of-the-data-warehouse

Daimler TSS Data Warehouse / DHBW 160

Page 161: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Data lifecycle – Dave Wells

https://www.eckerson.com/articles/the-future-of-the-data-warehouse

Daimler TSS Data Warehouse / DHBW 161

Page 162: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

DWH and Data Lake – Dave WellsIn parallel vs inside

https://www.eckerson.com/articles/the-future-of-the-data-warehouse

Daimler TSS Data Warehouse / DHBW 162

Page 163: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Dimensional modeling in the age of Big Data

• Dimensional modeling is not dead.

• The benefits are still valid in the age of Big Data, Hadoop, Spark, etc:

• Eliminate joins

• Data model is understandable for end users

• Well-suited for columnar storage + processing (e.g. SIMD)

• Nesting technique

• E.g. tables with lower granularity can be nested into large fact table

• Usage in SQL: Flatten(kvgen(<json>))

Daimler TSS Data Warehouse / DHBW 163

Page 164: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Self-service data

Functional

area

Why important Self-service approach

Data

acceleration

With shadow analytics, users create

redundant data copies.

The system must be capable of autonomously

identifying the best optimizations and adapting to

emerging query patterns over time.

Data

catalog

Data consumers struggle to find data that

is important to their work. Users keep

private notes about data sources and data

quality, meaning there is no governance.

In the self-service approach, the catalog is

automatic—as new data sources are brought online.

Data

virtualizatio

n

It is virtually impossible for an organization

to centralize all data in a single system.

Data consumers need to be able to access all data

sets equally well, regardless of the underlying

technology or location of the system.

https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics

Daimler TSS Data Warehouse / DHBW 164

Page 165: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Self-service data

Functional

area

Why important Self-service approach

Data

curation

There is no single “shape” of data that

works for everyone.

Data consumers need the ability to interact with

data sets from the context of the data itself, not

exclusively from simple metadata that fails to tell the

whole story. Data consumers should be capable of

reshaping data to their own needs without writing

any code or learning new languages.

Data lineage As data is accessed by data consumers

and in different processes, it is important

to track the provenance of the data, who

accessed the data, how the data was

accessed, what tools were used, and what

results were obtained.

As users reshape and share data sets with one

another through a virtual context, a self-service data

platform can seamlessly track these actions and all

states of data along the way, providing full audit

capabilities as well.

https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics

Daimler TSS Data Warehouse / DHBW 165

Page 166: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Self-service data

Functional

area

Why important Self-service approach

Open

source

Because data is essential to every area of

every business, the underlying data

formats and technologies used to access

and process the data should be open

source

Self-service data platforms build on open source

standards like Apache Parquet, Apache Arrow, and

Apache Calcite to store, query, and analyze data

from any source.

Security

controls

Organizations safeguard their data assets

with security controls that govern

authentication (you are who you say you

are), authorization (you can perform

specific actions), auditing (a record of the

actions you take), and encryption (you can

only read the data if you have the right

key).

Self-service data platforms integrate with existing

security controls of the organization, such as LDAP

and Kerberos.

https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics

Daimler TSS Data Warehouse / DHBW 166

Page 167: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Is the DWH dead?

Hadoop (and for similar reasons Spark) has its strengths but no as a DWH replacement, e.g.

• Fast query reads only possible in HBase with an inflexible (use case specific) data model

• No sophisticated query optimizer

• Hadoop is very complex with many tools/versions/vendors and no standard

• Security is still at the beginning

Daimler TSS Data Warehouse / DHBW 167

Page 168: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Digitization is the process of converting information into a digital (i.e. computer-readable) format

Digitization – the digital data explosion

Source: https://slideplayer.com/slide/10254426/

Daimler TSS Data Warehouse / DHBW 168

Page 169: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Future mobility

https://blog.daimler.com/en/2018/07/19/future-mobility-metropolitan-cities-congress/Daimler TSS Data Warehouse / DHBW 169

Page 170: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Display video

Is china the next silicon valley?

Daimler TSS Data Warehouse / DHBW 170

Page 171: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Some statisticsbat – baidu + alibaba + tencent

Sources:

https://venitism.wordpress.com/2017/12/15/beware-of-the-bats-baidu-alibaba-and-tencent/

https://www.afr.com/brand/business-summit/baidu-alibaba-tencent-to-disrupt-facebook-amazon-netflix-google-in-asia-20180228-h0wrdlDaimler TSS Data Warehouse / DHBW 171

Page 172: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Volume

Velocity

Variety

Veracity

Value

Digitization – challenges & opportunities

Ethics

AI

Impact

IoT, Industry 4.0, …

Data integration

Real-time

Data quality

Data-driven

Digital Services

Daimler TSS Data Warehouse / DHBW 172

Page 173: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Digitization public authorities

Daimler TSS Data Warehouse / DHBW 173

Page 174: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

Nathan Marz view on Schema

• RawnessStore the data as it is. No transformations.

• ImmutabilityDon’t update or delete data, just add more.

• Graph-like schema recommended

„Many developers go down the path of

writing their raw data in a schemaless

format like JSON. This is appealing because

of how easy it is to get started, but this

approach quickly leads to problems.

Whether due to bugs or misunderstandings

between different developers, data

corruption inevitably occurs“

(see page 103, Nathan Marz, „Big Data:

Principles and best practices of scalable

realtime data systems", Manning

Publications)

Source image: Nathan Marz, James Warren: Big Data: Principles and best practices of scalable realtime data systems, Manning Publications 2015

Daimler TSS Data Warehouse / DHBW 174

Page 175: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

sourc

e: M

ark

us

Sch

mid

berg

er

-B

ig D

ata

ist

tot

–E

sle

be

Busi

ness

In

tellig

enz?

, TD

WI 2

01

6

Daimler TSS Data Warehouse / DHBW 175

Page 176: Lecture @DHBW: Data Warehousebuckenhofer/20201DWH/Buckenhofer-… · Kimball bus architecture (Central Data Warehouse based on data marts) Data Warehouse Backend Frontend External

sourc

e: M

ark

us

Sch

mid

berg

er

-B

ig D

ata

ist

tot

–E

sle

be

Busi

ness

In

tellig

enz?

, TD

WI 2

01

6

Lambda @glomex

• Enrich batch-

driven data

processing with

real-time

requirements

• Adapt Lambda

architecture to

own requirements

Remark:

AWS Lambda # Lambda architecture