142
Ein Unternehmen der Daimler AG Lecture @DHBW: Data Warehouse 06 Data Catalog, Data Security, Frontend Andreas Buckenhofer

Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Ein Unternehmen der Daimler AG

Lecture @DHBW: Data Warehouse

06 Data Catalog, Data Security, Frontend

Andreas Buckenhofer

Page 2: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Daimler TSS GmbH

Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com

Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle

© Daimler TSS I Template Revision

Andreas BuckenhoferSenior DB Professional

Since 2009 at Daimler TSS

Department: Machine Learning Solutions

Business Unit: AnalyticsDHBWDOAG

xing

Contact/Connect

vcard

• Oracle ACE Associate

• DOAG responsible for InMemory DB

• Lecturer at DHBW

• Certified Data Vault Practitioner 2.0

• Certified Oracle Professional

• Certified IBM Big Data Architect

• Over 20 years experience with

database technologies

• Over 20 years experience with Data

Warehousing

• International project experience

Page 3: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Daimler TSS Data Warehouse / DHBW 3

Change Log

Date Changes

14.11.2019 Initial version

Page 4: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Daimler TSS Data Warehouse / DHBW 4

What you will learn today

After the end of this lecture you will be able to

• Explain the concepts behind a modern metadata management

• Data Catalog

• Have an overview of frontend requirements

• Information Design

• Understand importance of Data Security and Data Ethics

• Data Classification

• GDPR

• Risks

• Understand necessity for Data Culture

Page 5: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data Warehouse /

DHBWDaimler TSS 5

Data Catalog

Page 6: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Standard Data Warehouse architecture

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging

Layer

(Input

Layer)

OLTP

OLTP

Core

Warehouse

Layer

(Storage

Layer)

Mart Layer

(Output

Layer)

(Reporting

Layer)

Integration

Layer

(Cleansing

Layer)

Aggregation

Layer

Metadata Management

Security

DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 6

Page 7: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Making it easier to discover datasetsavailable since 05-sep-2018

Source: Google announcement https://www.blog.google/products/search/making-it-easier-discover-datasets/

Daimler TSS Data Warehouse / DHBW 7

Page 8: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Making it easier to discover datasetsavailable since 05-sep-2018

Daimler TSS Data Warehouse / DHBW 8

Page 9: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Find the right data

• With data science and analytics on the rise and under way to being democratized, the importance of being able to find the right data to investigate hypotheses and derive insights is paramountSource: https://www.zdnet.com/article/google-can-now-search-for-datasets-first-research-then-the-world

• Google Dataset search helps to find external data

• Schema.org defines open metadata format; dataset itself may not be open/free

• Search engines can interpret the format

• Ranking of data

• Help users discover where the data is and user can access it directly from the source

What about internal data?

Daimler TSS Data Warehouse / DHBW 9

Page 10: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

What is metadata?

Data

about

other data

Daimler TSS Data Warehouse / DHBW 10

Page 11: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Types of metadata

Business Metadata

• Business terms

• Domain model

• Glossary, etc

Legal Metadata

• Sensitive data

• GDPR

• Security

classification

Profiling

• Density

• Cardinality

• Min, Max, Median

• Popularity / Ranking

Technical Metadata

• Schema

• Table

• Column

• etc

Tagging / Linkage

Critical parts

Page 12: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Does Metadata Management provide answers to such questions across the whole workflow?

Search for data Work with data

Find Understand Trust Access Write

How to get access

to the data?

What tables are

important?

What table contains

production dates?

What is the difference

between production_date

and prod_dt?

How is this

column

calculated?

How to join the

tables?

Is FIN unique?

Who knows about

the data?

Is the data

reliable?

Daimler TSS Data Warehouse / DHBW 12

Page 13: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Cataloging source systemsMany formats = many connectors

• RDBMS (Oracle, Db2, SQL Server, Teradata, …)

• Hadoop (HDFS, Hive, …; on-premises, Cloud)

• NoSQL DBs

• Files (Excel, csv, …)

• Powerdesigner, Erwin, and other data modeling tools

Daimler TSS Data Warehouse / DHBW 13

Page 14: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Metadata differences: both tables contain same information, but different metadata

Vehicle_id Engine_id Cab_id Axle_id

WDB123 1234 ABCD XY12

Vehicle_id Type Id

WDB123 Engine 1234

WDB123 Cab ABCD

WDB123 Axle XY12

Daimler TSS Data Warehouse / DHBW 14

Page 15: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Metadata import used to be simple with RDBMS

Where is the data and

where is the

metadata in this

logfile?

Data Lake:

decentralized control

of the data

Daimler TSS Data Warehouse / DHBW 15

Page 16: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data Lake / Hadoop

• Easy approach: Access Hive Metastore and import metadata

• Prerequisite: all data/files in HDFS require Hive access

• But unrealistic prerequisite

• Many logs are just dumped into the file system

• Interpreting ALL files by Catalog SW unrealistic, too.

• Huge computing power

• Huge number of variations (Cloud, on-premises, SW versions) lacks support of vendors for Catalog SW

• Sources should deliver metadata

Daimler TSS Data Warehouse / DHBW 16

Page 17: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Cataloging @google

Source: https://ai.google/research/pubs/pub45390

Heavy usage of

Automation

and

Machine Learning

Daimler TSS Data Warehouse / DHBW 17

Page 18: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Cataloging at Netflix, Twitter, Linkedin, etc.

Company Link

Netflix (Metacat)https://medium.com/netflix-techblog/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520

https://github.com/Netflix/metacat

Twitterhttps://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html

LinkedIn (WhereHows)https://github.com/linkedin/WhereHows

https://github.com/linkedin/WhereHows/wiki

Google (Goods)https://ai.google/research/pubs/pub45390

https://www.buckenhofer.com/2016/10/goods-how-to-post-hoc-organize-the-data-lake/

Uberhttps://eng.uber.com/databook/

ebayhttps://www.ebayinc.com/stories/blogs/tech/bigdata-governance-hive-metastore-listener-for-apache-atlas-use-cases/

Lyfthttps://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9

Daimler TSS Data Warehouse / DHBW 18

Page 19: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Cataloging @uber

Source: https://eng.uber.com/databook/

Daimler TSS Data Warehouse / DHBW 19

Page 20: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Cataloging @twitter

Source: https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.htmlDaimler TSS Data Warehouse / DHBW 20

Page 21: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Cataloging @linkedin (open source)

Source: https://github.com/LinkedIn/Wherehows/wiki

Daimler TSS Data Warehouse / DHBW 21

Page 22: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Catalogs are everywhere … Google, Amazon

USER EXPERIENCEINVENTORY

Daimler TSS Data Warehouse / DHBW 22

Page 23: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Inventory vs user experience

Suppliers provide inventory

•A Catalog should list everything that is actually available

Consumers require user experience

•A Catalog should provide data usage statistics, ratings, data samples, statistical profiles, lineage, lists of users and stewards, and tips on how the data should be interpreted

Daimler TSS Data Warehouse / DHBW 23

Page 24: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Automation, crowd knowledge, and experts

• limitation of permissions to a trusted group

• A trusted group documents few datasets very well

• But most of the metadata is not documented

• Failure of many past approaches

• Automation, crowd knowledge and experts required

• Automation to get a broad coverage and use existing information like query logs

• Crowd to increase broad coverage

• Experts to confirm or reject „guesses“

-> Combination of coverage and accuracy

Daimler TSS Data Warehouse / DHBW 24

Page 25: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Types of metadata

Business Metadata

• List of business

terms

• Glossary, etc

Legal Metadata

• Sensitive data

• GDPR

• Security

classification

Profiling

• Density

• Cardinality

• Min, Max, Median

• Popularity / Ranking

Technical Metadata

• Schema

• Table

• Column

• etc

Tagging / Linkage

Critical parts

Automation, crowd

knowledge and experts

required

Page 26: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data Catalog – Amazon for information

Data Catalog

Technical Metadata

Business Metadata

Collective Intelligence

Expert Sourcing

Data Access

Governance

MachineLearning

Automation

Inventory

User experience

& enrichment

Daimler TSS Data Warehouse / DHBW 26

Page 27: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Catalog search

Daimler TSS Data Warehouse / DHBW 27

Page 28: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Is the data Catalog a “metadata management reloaded”?

Name it as you like, but there are some critical developments

• Automation, Collective intelligence and expert knowledge

• Enable crowd sourcing and get help from other users

• Help to understand quality of data and usage of datasets

• Rating of information

• Web application for search / collaboration and API to access metadata

• Governance and legal framework for e.g. GDPR scenarios

• Capture metadata for security and end-user data consumption

• Identify the owner of the dataset and get access to source data

Daimler TSS Data Warehouse / DHBW 28

Page 29: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data Warehouse /

DHBWDaimler TSS 29

Frontend

Page 30: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Standard Data Warehouse architecture

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging

Layer

(Input

Layer)

OLTP

OLTP

Core

Warehouse

Layer

(Storage

Layer)

Mart Layer

(Output

Layer)

(Reporting

Layer)

Integration

Layer

(Cleansing

Layer)

Aggregation

Layer

Metadata Management

Security

DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 30

Page 31: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Visualization in the usual case of life

Daimler TSS Data Warehouse / DHBW 31

Page 32: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Russian Campaign of Napoleon

Source: https://de.wikipedia.org/wiki/Charles_Joseph_Minard

Daimler TSS Data Warehouse / DHBW 32

Page 33: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Mapping the 1854 London Cholera Outbreak

Source: https://www1.udel.edu/johnmack/frec682/cholera/

Daimler TSS Data Warehouse / DHBW 33

Page 34: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Mapping the 1854 London Cholera Outbreak

Daimler TSS Data Warehouse / DHBW 34

Page 35: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Excercise: visualize as much as possible

Umsatz in €

2014 2015 2016

Kanada 16.000 14.000 17.000

England 8.000 9.000 8.000

Frankreich 7.000 4.000 5.000

USA 60.000 85.000 90.000

Deutschland 4.000 10.000 15.000

Australien 10.000 8.000 15.000

Umsatz 105.000 130.000 150.000

Daimler TSS Data Warehouse / DHBW 35

Page 36: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Possible Solution 1

Daimler TSS Data Warehouse / DHBW 36

Page 37: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Possible Solution 2

Daimler TSS Data Warehouse / DHBW 37

Page 38: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Interface to the end user

• Reporting (Standard, ad-hoc)

• OLAP interactive analysis

• Dashboards, Scorecards

• Advanced Analytics / Data Mining / Text Mining

• Search & Discovery

Daimler TSS Data Warehouse / DHBW 38

Page 39: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Reporting (Standard, ad-hoc)

• Standard Reports

• Prepared static reports that can be executed at request by end users

• Are executed at the end of an ETL process and e.g. send by email to end users

• Normally based on fact tables and its dimensions

• Reports are often lists similar to Excel-Sheets but can also contain graphics (e.g. line charts)

• Ad-hoc Reports

• End users create their own reports („Self service“)

Daimler TSS Data Warehouse / DHBW 39

Page 40: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

OLAP interactive analysis

ROLAP / MOLAP Client Frontend

• Prepared cubes (multidimensional or relational fact tables)

• User can perform interactive analysis of data

• Rollup / drill-down

• Pivot

• Slicing

• Dicing

Daimler TSS Data Warehouse / DHBW 40

Page 41: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Dashboards, Scorecards

• „Progress reports“

• Provide an overall view of KPIs (Key Performance Indicators)

• Combination of several elements from Reporting and/or OLAP (e.g. line charts) into an overall view (like a „cockpit“)

• Dashboard is more focused on operational goals

• High-level overview what is happening

• Scorecard is more focused on strategic goals

• Plan a strategy and identify why something happens

Daimler TSS Data Warehouse / DHBW 41

Page 42: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Software is eating the world

Machine learning will eat software• For many tasks, it’s easier to collect the data than to explicitly write the

program, e.g. face recognition or chess/go

• On the other hand, data collection isn’t always easy, e.g. billing SW

Advanced Analytics / Data Mining / Text MiningThe future of software development?

Source: https://www.oreilly.com/ideas/what-machine-learning-means-for-software-development

Daimler TSS Data Warehouse / DHBW 42

Page 43: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Machine Learning will change SW development

• Google’s Jeff Dean has reported that 500 lines of TensorFlow code has replaced 500,000 lines of code in Google Translate

• Don’t understate the difficulty of training a neural network of any complexity, but neither should we underestimate the problem of managing and debugging a gigantic codebase

• The developer has to become a teacher, a curator of training data, and an analyst of results

Source: https://www.oreilly.com/ideas/what-machine-learning-means-for-software-development

Source: https://twitter.com/DynamicWebPaige/status/915326707107844097

Data Warehouse / DHBW 43Source: https://twitter.com/markmadsen/status/1194622452430712833?s=20

Page 44: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Search & Discovery

• Not just numerical data

• Analysis of new data types gets more and more important

• Text

• GPS coordinates

• Pictures

• Videos

• Data can be available in RDBMS (e.g. text modules/indexes available), Hadoop or SQL DBs

Daimler TSS Data Warehouse / DHBW 44

Page 45: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Many graphical elements to use in reports

Source: https://github.com/d3/d3/wiki/GalleryDaimler TSS Data Warehouse / DHBW 45

Page 46: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Source: Hichert / Faisst, http://www.backup-page.hichert.com/

Many graphical elements … chamber of horror

Daimler TSS Data Warehouse / DHBW 46

Page 47: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Many graphical elements … chamber of horror

Some remarks about previous slides

• 3D elements introduce clutter and give not more information

• Pie chart most often does not make sense

• Line chart barely readable

• Labels are placed outside of the graphic

• Tachometer costs a lot of space and show

• Too much color in general

• Color without meaning, e.g. red should be used for alarms / errors

Daimler TSS Data Warehouse / DHBW 47

Page 48: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Do you use 3D usually ?

Daimler TSS Data Warehouse / DHBW 48

Page 49: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

36 + 89 + 57 + 61 = 100

Source: BIKE magazine 8/2019

Daimler TSS Data Warehouse / DHBW 49

Page 50: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Did you know? Pizza is a real-time chart of how muchpizza is left

Daimler TSS Data Warehouse / DHBW 50

Page 51: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Story telling with appropriate visualization

Famous example by Hans Rosling (watch 3:08 onwards)

https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen?language=de

Daimler TSS Data Warehouse / DHBW 51

Page 52: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Daimler TSS Data Warehouse / DHBW 52

InfoGraphics - Chart Guide

Source: https://chart.guide/poster

Page 53: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Daimler TSS Data Warehouse / DHBW 53

Top infographics - The Forces Shaping the Future of the Global Economy

Source: https://www.visualcapitalist.com/our-top-infographics-of-2018/

Page 54: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Daimler TSS Data Warehouse / DHBW 54

Top infographics - The Forces Shaping the Future of the Global Economy: details

Source: https://www.visualcapitalist.com/the-8-major-forces-shaping-the-future-of-the-global-economy/

Page 55: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Information Design

• Information design is the practice of presenting information in a way that fosters efficient and effective understanding of it.(source: Wikipedia, https://en.wikipedia.org/wiki/Information_design )

• Some authors are well known for their criticism of many graphical representations - they provide rules for good information design

• Edward Tufte

• Stephen Few

• Rolf Hichert

Daimler TSS Data Warehouse / DHBW 55

Page 56: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Which productgroup has the highest win in june?

Daimler TSS Data Warehouse / DHBW 56

Page 57: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Which productgroup has the highest win in june?Eye tracking

Daimler TSS Data Warehouse / DHBW 57

Page 58: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Which productgroup has the highest win in june?Improved version

Daimler TSS Data Warehouse / DHBW 58

Page 59: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Which productgroup has the highest win in june?Eye tracking

Daimler TSS Data Warehouse / DHBW 59

Page 60: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Information DesignReduce to the essentials

Define standards, e.g.

• use always the same colors and with care, e.g.

• red = negative

• green = positive

• pie charts are rarely useful and should be avoided

• better use bar chart or line chart

• No 3D elements as these elements don’t enhance information but introduce clutter

• Standardize abbreviations, e.g. PY = previous year

Daimler TSS Data Warehouse / DHBW 60

Page 61: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Eye-tracking - before and after

Daimler TSS Data Warehouse / DHBW 61

Page 62: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Table with integrated bar charts

Source: Hichert, http://www.hichert.com/de/resource/table-template-02/

Daimler TSS Data Warehouse / DHBW 62

Page 63: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

BI end user roles

Consumers / BI Users

• use reports, OLAP and dashboards to obtain information

Power Users

• Use reports , OLAP and dashboards to obtain information

• Create new reports and dashboards

Data Scientists

• Statistical / mathematical geeks

• Analyze / explore data

• Need to analyze raw (non-cleansed, non-transformed) data

Daimler TSS Data Warehouse / DHBW 63

Page 64: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Daimler TSS Data Warehouse / DHBW 64

Summary

• Infographics

• Comprehensive graphics

• Storytelling

• Information Design

• Use information efficiently and effectively

Page 65: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data Warehouse /

DHBWDaimler TSS 65

Data Security

Page 66: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Standard Data Warehouse architecture

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging

Layer

(Input

Layer)

OLTP

OLTP

Core

Warehouse

Layer

(Storage

Layer)

Mart Layer

(Output

Layer)

(Reporting

Layer)

Integration

Layer

(Cleansing

Layer)

Aggregation

Layer

Metadata Management

Security

DWH Manager incl. Monitor

Daimler TSS Data Warehouse / DHBW 66

Page 67: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

GDPR – general data protection regulation

Regulation in EU law on data protection and privacy for all individuals in the EU

Requirements like

• Data protection by design and by default

• Right to erasure / Right to be forgotten

• Right to data portability

• Records of processing activities

Daimler TSS Data Warehouse / DHBW 67

Page 68: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Challenges DWH & Big Data

• "capture-it-all" approach causes serious questions of privacy

• always ask yourself why you are capturing or storing data

Source: https://martinfowler.com/bliki/Datensparsamkeit.html

Challenges like

• Combining of sensitive data

sources allowed?

• Export restrictions that

forbid to combine data from

Germany, US, China

Daimler TSS Data Warehouse / DHBW 68

Page 69: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Daimler TSS Data Warehouse / DHBW 69

Data processing principles (GDPR Art. 5.1)

Principle Description

Lawfulness, fairness and transparency

(Rechtmäßigkeit, Verarbeitung nach Treu

und Glauben, Transparenz)

Processing must be according to law and

comprehensible for individuals

Purpose limitation (Zweckbindung) Use data for defined purpose only

Data minimisation (Datenminimierung) Data usage must be adequate

Accuracy (Richtigkeit) Personal Data needs to be correct

Storage limitation (Speicherbegrenzung) Keep data no longer than necessary

Integrity and confidentiality (Integrität und

Vertraulichkeit)

Protection against unauthorised or unlawful

processing and against accidental loss,

destruction or damage

Page 70: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Pseudonymization vs Anonymization

Personal data

Pseudonymized data are

data that allow re-

identification

Anonymized data are data

where the data subject is

no longer identifiable

Pseudonymization (e.g. Hashing)

Anonymization

Re-Identification with additional informationen

Name

VIN

IP-Adress

Home town

Telematic data

Daimler TSS Data Warehouse / DHBW 70

Page 71: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Anonymization by deleting data

Name Email Gender Birthdate ZIP Salary

Peter Miller [email protected] M 15.05.2001 89075 80.000

Martin Bush [email protected] F 22.07.1967 70079 85.000

Susan Dill [email protected] F 03.11.1978 60067 90.000

Daimler TSS Data Warehouse / DHBW 71

Page 72: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Anonymization by deleting data

Name Email Gender Birthdate ZIP Salary

M 15.05.2001 89075 80.000

F 22.07.1967 70079 85.000

F 03.11.1978 60067 90.000

Daimler TSS Data Warehouse / DHBW 72

People in the US can be identified in 87% by

gender, birthdate and zip (Latanya Sweeney)

https://dataprivacylab.org/projects/identifiability/paper1.pdf

Risiko Test: https://cpg.doc.ic.ac.uk/individual-risk/

Page 73: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Facebook Data Leak

• Personality-App

• Data from 3 Mio users

• Data were anonymized according to Facebook

• Gender, birthdate, and zip ↯ ↯ ↯

• Results from personality tests

• (Additionally, username and password were available to access the data)

Source: https://www.newscientist.com/article/2168713-huge-new-facebook-data-leak-exposed-intimate-details-of-3m-users/

Daimler TSS Data Warehouse / DHBW 73

Page 74: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Fine for false anonymization

Quelle: https://edpb.europa.eu/news/national-news/2019/danish-data-protection-agency-proposes-dkk-12-million-fine-danish-taxi_en

Daimler TSS Data Warehouse / DHBW 74

Page 75: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

CIA triad – data classification

Confidentiality

IntegrityAvailability

whether or not

information is kept

secret or private,

e.g. data theft

whether the

information is kept

accurate, e.g.

faking data

ensuring that

information is

available when it is

needed, e.g.

blackout

Daimler TSS Data Warehouse / DHBW 75

Page 76: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Typical security functionalities and measures in databases

Auditing

Encryption

Masking (static, dynamic)

Row Level Security

Authentication & Password

policies

Patching

Secure installation

SSL – secure

communication

Authorization & roles,

profiles, ressource limits

Security checklists

Page 77: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Daimler TSS 77

Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/

Page 78: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/

Page 79: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data Warehouse /

DHBWDaimler TSS 79

Data Culture

Page 80: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data quality and Meta Data management

Domain knowledge

Data culture

Digitization hot spotsBiMa-studie 2018 (BARC + sopra steria consulting)

Daimler TSS Data Warehouse / DHBW 80

Page 81: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Digitization - New giants „BAT“Baidu (Google) + Alibaba (Amazon) + Tencent (Facebook)

Sources:

https://venitism.wordpress.com/2017/12/15/beware-of-the-bats-baidu-alibaba-and-tencent/

https://www.afr.com/brand/business-summit/baidu-alibaba-tencent-to-disrupt-facebook-amazon-netflix-google-in-asia-20180228-h0wrdl

Daimler TSS Data Warehouse / DHBW 81

Page 82: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Stephen Few: Big Data, Big Dupe (Analytics Press, 2018)

• “Taping into the potential of data involves data sensemaking. I prefer the term over the more popular term analytics because it better fits the full range of activities that are needed” (page 44)

• “Data sensemaking requires a significant investment in the development of human skills”(page 74)

• DWH/Big Data Architecture (instead of technology/tools)

• Data integration, Data visualization, Data modeling, …

Daimler TSS Data Warehouse / DHBW 82

Page 83: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data landscape manifesto@scout24

Source: Data Festival, Munich 2019

Daimler TSS Data Warehouse / DHBW 83

Page 84: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Netflix’ three credos for data culture

Source: Phil Simons’ webinar on “The visual organization: data visualization, big data, and the quest for better decisions” from

Harvard Business Review (2014)

Data catalog: Data should be accessible, easy to discover, and easy to process for everyone

Data storytelling: Whether your dataset is large or small, being able to visualize it makes it easier to explain

Fail fast and iterate: The longer you take to find the data, the less valuable it becomes

Daimler TSS Data Warehouse / DHBW 84

Page 85: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

From DevOps to DataOps

A collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and consumers across an organization.

The goal of DataOps is to create predictable delivery and change management of data, data models and related artifacts. DataOps uses

technology to automate data delivery with the appropriate levels of security, quality and metadata to improve the use and value of data in a dynamic environment.

Source: https://medium.com/data-ops/the-best-dataops-articles-of-q3-2018-c39882be3d7b

Daimler TSS Data Warehouse / DHBW 85

Page 86: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data Culture

Establish a culture for

• Sharing data across the organization by following privacy and ethics

• Collaborating around data products and data platforms

• Data-driven decisions instead of e.g. experience or best paid employee in the room

• Speed: fail fast and iterate

Being successful as a data-driven company requires the active involvement of all employees

Daimler TSS Data Warehouse / DHBW 86

Page 87: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

We’re entering a new world in which

data may be more important than

software.

[Tim O’Reilly, Founder O’Reilly Media]

Data is a precious thing and will last longer than

the systems themselves.

[Tim Berners-Lee, Father of the Worldwide Web]

Information is the oil of the 21st

century

[Peter Sondergaard, Gartner]

Everything we do in the digital realm ... creates a data trail.

And if that trail exists, chances are someone is using it.

[Douglas Rushkoff, Author]

Page 88: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data creation is exploding

[Gavin Belson, HBOs Silicon Valley]

Data is the new gold

[Open Data Initiative, European Commission]

In a world deluged by irrelevant

information, clarity is power.

[Yuval Noah Harari, Author]

Big data is not about the data

[Gary King, Harvard University]

Page 89: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data WarehousE

Data Warehouse /

DHBWDaimler TSS 89

Applications come, applications go.

The data, however, lives forever.

It is not about building applications;

it really is about the data underneath these applications

(Tom Kyte, Oracle)

Page 90: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Daimler TSS GmbH

Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com

Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle

© Daimler TSS I Template Revision

Page 91: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Visual Vocabulary

https://github.com/ft-interactive/chart-doctor/tree/master/visual-vocabulary

http://www.vizwiz.com/2018/07/visual-vocabulary.html

Daimler TSS Data Warehouse / DHBW 91

Page 92: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Learn how to replace code

• Machine learning will no doubt change software development in significant ways

• Software developers will put much more effort into data collection and preparation

• Developers will have to do more than just collect data; they’ll have to build data pipelines and the infrastructure to manage those pipelines. We’ve called this “Data Engineering”

• Data engineers will be responsible for maintaining the data pipeline: ingesting data, cleaning data, feature engineering, and model discovery

Source: https://www.oreilly.com/ideas/what-machine-learning-means-for-software-developmentDaimler TSS Data Warehouse / DHBW 92

Page 93: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Types of metadata (1)

Business Metadata

• Definition of business vocabulary and relationships

• Definition of the value range

• Linkage to physical representation

Daimler TSS Data Warehouse / DHBW 93

Page 94: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Types of metadata (2)

Report and ETL metadata

• Report definitions

• Data sources

• Column definitions

• Computations

Logical and physical metadata of data model

• Table structure

• Definition of columns

• Relationships between tables and columns

• Dimension hierarchy Daimler TSS Data Warehouse / DHBW 94

Page 95: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Benefits of Metadata management

Source: Detlef Apel: Datenqualität erfolgreich steuern, dpunkt 2015, chapter 14

• Data Lineage and dependencies

• Generating and controlling DWH processes

• Improve SW development quality

• Increase comprehensibility of KPIs

Daimler TSS Data Warehouse / DHBW 95

Page 96: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Technical Metadata managementvery often not successful

Metadata Repository

OLTP-1

OLTP-2

Microservice-1Microservice-1

Microservice-1Microservice-1

DWH

Data Lake

Who enriches / tags

technical metadata

with

legal and business

relevant information???

Daimler TSS Data Warehouse / DHBW 96

Page 97: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data Catalog a hot topic

• New Data Catalog vendors are entering the market

• Established vendors rebrand and enrich their existing toolsDaimler TSS Data Warehouse / DHBW 97

Page 98: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Alation architecture

Not just an RDBMS for structured

metadata, but also storage engines for text

data

Daimler TSS Data Warehouse / DHBW 98

Page 99: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data scientist sexiest job of the 21st century?

Source: https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientist-articleDaimler TSS Data Warehouse / DHBW 99

Page 100: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

SQL standard is evolving

Source: https://vimeo.com/289497563

Daimler TSS Data Warehouse / DHBW 100

Page 101: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Exercise: compute most recent rows

Write an SQL statement that computes the most recent data for each customer.

Script to create the table including data: https://github.com/abuckenhofer/dwh_course/tree/master/scripts

Customer_

key

Name Status Valid_from

1 Brown Single 01-MAY-2014

2 Bush Married 05-JAN-2015

1 Miller Married 15-DEC-2015

3 Stein 15-DEC-2015

3 Stein Single 18-DEC-2015

SIN1.sqlDaimler TSS Data Warehouse / DHBW 101

Page 102: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Exercise: compute most recent rows

Daimler TSS Data Warehouse / DHBW 102

Page 103: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Exercise: compute most recent rows

create table s_customer (customer_key integer NOT NULL

, cust_name varchar2(100) NOT NULL, status varchar2(10), valid_from date NOT NULL, CONSTRAINT s_customer_pk PRIMARY KEY (customer_key, valid_from)

);

insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Brown', 'Single', to_date('01.05.2014', 'DD.MM.YYYY'));

insert into s_customer (customer_key, cust_name, status, valid_from) values (2, 'Bush', 'Married', to_date('05.01.2015', 'DD.MM.YYYY'));

insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Miller', 'Married', to_date('15.12.2015', 'DD.MM.YYYY'));

insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein', NULL, to_date('15.12.2015', 'DD.MM.YYYY'));

insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein', 'Single', to_date('18.12.2015', 'DD.MM.YYYY'));

commit;

Daimler TSS Data Warehouse / DHBW 103

Page 104: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Exercise: compute most recent rowssolution 1: max-function

SELECT s.*

FROM S_CUSTOMER s

JOIN (SELECT i.customer_key,

max(i.valid_from) as max_valid_from

FROM S_CUSTOMER i

GROUP BY i.customer_key) b

ON s.customer_key = b.customer_key

AND s.valid_from = b.max_valid_from;

S2IN.sqlDaimler TSS Data Warehouse / DHBW 104

Page 105: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Exercise: compute most recent rowssolution 2: exists

SELECT s.*

FROM S_CUSTOMER s

WHERE NOT EXISTS (SELECT 1

FROM S_CUSTOMER i

WHERE s.customer_key = i.customer_key

AND s.valid_from < i.valid_from);

S2IN.sqlDaimler TSS Data Warehouse / DHBW 105

Page 106: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Exercise: compute most recent rowssolution 3: max in correlated sub-select

SELECT s.*

FROM S_CUSTOMER s

WHERE s.valid_from = (SELECT MAX(i.valid_from)

FROM S_CUSTOMER i

WHERE s.customer_key = i.customer_key);

S2IN.sqlDaimler TSS Data Warehouse / DHBW 106

Page 107: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Exercise: compute most recent rowssolution 4: coalesce with sub-select

SELECT *

FROM (SELECT coalesce ((SELECT min (i.valid_from)

FROM S_CUSTOMER i

WHERE s.customer_key = i.customer_key

AND s.valid_from < i.valid_from

), to_date ('31.12.9999', 'DD.MM.YYYY'))

as end_ts,

s.*

FROM S_CUSTOMER s)

WHERE end_ts = to_date ('31.12.9999', 'DD.MM.YYYY');

S2IN.sqlDaimler TSS Data Warehouse / DHBW 107

Page 108: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Exercise: compute most recent rowssolution 5: max-function and with-clause

WITH max_cust as (

SELECT i.customer_key,

max(i.valid_from) as max_valid_from

FROM S_CUSTOMER i

GROUP BY i.customer_key)

SELECT s.*

FROM S_CUSTOMER s

JOIN max_cust b ON s.customer_key = b.customer_key

AND s.valid_from = b.max_valid_from;

S2IN.sqlDaimler TSS Data Warehouse / DHBW 108

Page 109: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

partition data

compute functions over these partitions

Rank [sequential order], first [first row], last [last row], lag [previous row], lead [next row]

return result

Exercise: compute most recent rowsSQL Analytic / Windowing functions

Daimler TSS Data Warehouse / DHBW 109

Page 110: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Exercise: compute most recent rowssolution 6: Analytic / Windowing function lead

WITH lead_cust as (

SELECT lead (s.valid_from, 1) OVER (PARTITION BY

s.customer_key

ORDER BY s.valid_from ASC) as end_ts

, s.*

FROM s_customer s)

SELECT *

FROM lead_cust b

WHERE b.end_ts IS NULL;

S3IN.sqlDaimler TSS Data Warehouse / DHBW 110

Page 111: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Exercise: compute most recent rowssolution 7: Analytic function row_number

WITH lead_cust as (

SELECT row_number() OVER(PARTITION BY s.customer_key

ORDER BY s.valid_from DESC) as rn

, s.*

FROM s_customer s)

SELECT *

FROM lead_cust b

WHERE b.rn = 1;

S3IN.sqlDaimler TSS Data Warehouse / DHBW 111

Page 112: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Max or Analytic / Windowing functions Which alternative would you recommend?

• Check execution plans, execution time including service + response time, resource usage for final decision

• Solutions with Analytic / Windowing do not need self-join and show better statistics compared to the other shown solutions

• Analytic / Windowing functions are very powerful

• Remark: Usage of with-clause in SQL statements is preferable compared to sub-selects as it improves readability, understandability, maintainability

Daimler TSS Data Warehouse / DHBW 112

Page 113: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Temporal data storage (Bitemporal data)

Daimler TSS Data Warehouse / DHBW 113

Page 114: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Temporal data storage (Bitemporal data)

10.09. 20.09. 30.09. 10.10.

Time

Price: 15EUR Price: 16EUR

New Price of 16EUR is

entered into the DB

Valid

Time

(20.09.)

Transaction

Time

(10.09.)

Daimler TSS Data Warehouse / DHBW 114

Page 115: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

• Time period when a fact is true in the real world

• The end user determines start and end date/time (or just a date/time for events)

Business validity:

Valid time

• Time period when a fact stored in the database is known

• ETL process determines start and end date/time

Technical validity:Transaction time

• Combines both Valid and Transaction TimeBitemporal data

Temporal data storage (Bitemporal data)Definition

Daimler TSS Data Warehouse / DHBW 115

Page 116: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Temporal data storage (Bitemporal data)SQL standard

• SQL standard SQL:2011

• But different implementations by RDBMSes like Oracle, Db2, SQL Server and others

• Different syntax!

• Different coverage of standard!

• Very useful for slowly changing dimensions type 2, but also for other purposes

Daimler TSS Data Warehouse / DHBW 116

Page 117: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Db2 Valid Time example

CREATE TABLE customer_address

( customerID INTEGER NOT NULL

, name VARCHAR(100)

, city VARCHAR(100)

, valid_start DATE NOT NULL

, valid_end DATE NOT NULL

, PERIOD BUSINESS_TIME(valid_start, valid_end)

, PRIMARY KEY(customerID, BUSINESS_TIME WITHOUT OVERLAPS) );

Daimler TSS Data Warehouse / DHBW 117

Page 118: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Db2 Valid Time example

INSERT INTO customer_address VALUES

(1, 'Miller', 'Seattle', '01.01.2013', '31.12.2013');

UPDATE customer_address FOR PORTION OF BUSINESS_TIME

FROM '22.05.2013' TO '31.12.2013'

SET city = 'San Diego' WHERE customerID = 1;

customerID Name City Valid_start Valid_end

1 Miller Seattle 01.01.2013 22.05.2013

1 Miller San Diego 22.05.2013 31.12.2013

Daimler TSS Data Warehouse / DHBW 118

Page 119: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Db2 Valid Time example

SELECT *

FROM customer_address

FOR BUSINESS_TIME AS OF '17.05.2013';

Daimler TSS Data Warehouse / DHBW 119

Page 120: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Db2 transaction Time example

CREATE TABLE customer_info(

customerId INTEGER NOT NULL,

comment VARCHAR(1000) NOT NULL,

sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN,

sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,

PERIOD SYSTEM_TIME (sys_start, sys_end)

);

Daimler TSS Data Warehouse / DHBW 120

Page 121: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Db2 transaction Time example

Transaction on 15.10.2013:

INSERT INTO customer_info VALUES( 1, 'comment 1');

Transaction on 31.10.2013

UPDATE customer_address SET comment = 'comment 2'

WHERE customerID = 1;

CustomerI

d

comment Sys_start Sys_end

1 Comment

2

31.10.2013 31.12.2999

Daimler TSS Data Warehouse / DHBW 121

Page 122: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Db2 transaction Time example

SELECT *

FROM customer_info FOR SYSTEM_TIME AS OF '17.10.2013';

Data comes from a history table:

Valid Time and Transaction Time can be combined = Bitemporal table

CustomerId comment Sys_start Sys_end

1 Comment 1 15.10.2013 31.10.2013

Daimler TSS Data Warehouse / DHBW 122

Page 123: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Indexing - why

• Very important performance improvement technique

• Good for many reads with high selectivity, write penalty

• B-trees most common

root

branch branch

leaf leaf leaf

Table

Daimler TSS Data Warehouse / DHBW 123

Page 124: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Indexing a star schema – which columns are candidates for an index?

• DBs index Primary Keys by default

• Dimension table columns that are regularly used in where clausesare candidates

• Maybe foreign Key columns in Fact table (see also later Star Transformation)

Daimler TSS Data Warehouse / DHBW 124

Page 125: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Star transformation

• Fact table has normally much more rows compared to dimension tables

• Common join techniques would need to join first dimension table with the fact table

• Alternative technique: evaluate all dimensions(cartesian join)

• Then join into fact table in last step

• Oracle uses Bitmap indexes on foreign key columns in fact tables to achieveStar Join; not supported by many DBs

Daimler TSS Data Warehouse / DHBW 125

Page 126: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Daimler TSS Data Warehouse / DHBW 126

Partitioning

Col1 Col2 Col3 col4

1 A AA AAA

2 B BB BBB

3 C CC CCC

Col1 Col2

1 A

2 B

3 C

Col3 col4

AA AAA

BB BBB

CC CCC

Col1 Col2 Col3 col4

3 C CC CCC

Col1 Col2 Col3 col4

1 A AA AAA

2 B BB BBB

Vertical partitioning Horizontal partitioning

Page 127: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Horizontal partitioning

• Very powerful feature in a DWH to reduce workload

• Split table into logical smaller tables

• Avoidance of full table scans

• How could a table be split?

• Introduction to (Oracle) partitioning: https://asktom.oracle.com/partitioning-for-developers.htm

Daimler TSS Data Warehouse / DHBW 127

Page 128: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Horizontal Partitioning – splitting options

• By range

• Most common

• Use date field like order data to partition table into months, days, etc

• By list

• Use field that has limited number of different values, e.g. split customer data by country if end users most likely select customers from within a country

• By hash

• Use a filed that most likely splits the data in evenly distributed chunks

Daimler TSS Data Warehouse / DHBW 128

Page 129: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Parallelism

• Statements are normally executed on one CPU

• Parallelism allows the DB to distribute the execution to several CPUs

• Powerful combination with partitioning

• Parallelism is limited by the number of CPUs: if parallelism is too high, performance will degrade

• Intra-query parallelism and inter-query parallelism

Daimler TSS Data Warehouse / DHBW 129

Page 130: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Compression

• Data compression + Index compression

• Store more data in a block/page = read more data during I/O

• If CPU resources are available, often a very powerful feature to improve performance

• Additionally reduce storage

• Additionally reduce backup time

Daimler TSS Data Warehouse / DHBW 130

Page 131: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Already covered in a previous lecture

• Relational columnar In-Memory DB

• Materialized Views / Query Tables

Daimler TSS Data Warehouse / DHBW 131

Page 132: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Exercise - recapture ETL and DB specifics

• Recapture ETL and DB specific topics

• Which topics do you remember, or do you find important?

• Write down 1-2 topics on stick-it cards.

Daimler TSS Data Warehouse / DHBW 132

Page 133: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Working with Alation

• Onboarding of data sources

• IT just creates cover and grants access

• Self-service: Done by application owners including source system connection

• No need for central password management

• Scales for onboarding of many systems

Daimler TSS Data Warehouse / DHBW 133

Page 134: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Schema and its tables

Daimler TSS Data Warehouse / DHBW 134

Page 135: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Table and its columns with sample data

Daimler TSS Data Warehouse / DHBW 135

Page 136: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Columns and relationships

Daimler TSS Data Warehouse / DHBW 136

Page 137: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Legal tagsGDPR and other regulations

Associate legal tags

• Articles 16-21

• Identify data

• Right to erasure

• Right to be forgotten

Daimler TSS Data Warehouse / DHBW 137

Page 138: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Central vs local Data Catalogs

Central data Catalog

• Integrated views

• Mammoth task

• No redundancy

Local data Catalogs (reality)

• Legal requirements

• Feasibility

• Tool support very weak

Data

Cata

logSource 1

Source 2

Source 3

Data

Cata

log

Source 1

Source 2

Source 3

Data

Cata

log

Daimler TSS Data Warehouse / DHBW 138

Page 139: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Summaryinformation is beautiful

Source: https://www.youtube.com/watch?v=hOex1iU57iw

Daimler TSS Data Warehouse / DHBW 139

Page 140: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Data access

• End users should only have access to the Data Marts in a DWH

• Grant privileges and roles to users

• Good practice:

• Grant privileges directly to a role; do not nest roles (grant role to role)

• Grant role to users

• Distinguish end users, administrators and technical users with different password policies

• Much more difficult to grant access in a Data Lake

• Tool maturity

• Data often unknown with schema-on-read

Daimler TSS Data Warehouse / DHBW 140

Page 141: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Evaluation criteria

Technical Metadata

Business Metadata incl. Glossary

Tagging (Linkage)

Collective Intelligence(Collaboration)

Search

Security

Source connectors

Data profiling

Data access

Lineage

API

Versioning

Architecture

Components

Prerequisites

LicencingDaimler TSS Data Warehouse / DHBW 141

Page 142: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On

Over 75%• of time is spent for

• say they least enjoy

DATA PREPARATION

Data Consumers