Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Ein Unternehmen der Daimler AG
Lecture @DHBW: Data Warehouse
06 Data Catalog, Data Security, Frontend
Andreas Buckenhofer
Daimler TSS GmbH
Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com
Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle
© Daimler TSS I Template Revision
Andreas BuckenhoferSenior DB Professional
Since 2009 at Daimler TSS
Department: Machine Learning Solutions
Business Unit: AnalyticsDHBWDOAG
Contact/Connect
vcard
• Oracle ACE Associate
• DOAG responsible for InMemory DB
• Lecturer at DHBW
• Certified Data Vault Practitioner 2.0
• Certified Oracle Professional
• Certified IBM Big Data Architect
• Over 20 years experience with
database technologies
• Over 20 years experience with Data
Warehousing
• International project experience
Daimler TSS Data Warehouse / DHBW 3
Change Log
Date Changes
14.11.2019 Initial version
Daimler TSS Data Warehouse / DHBW 4
What you will learn today
After the end of this lecture you will be able to
• Explain the concepts behind a modern metadata management
• Data Catalog
• Have an overview of frontend requirements
• Information Design
• Understand importance of Data Security and Data Ethics
• Data Classification
• GDPR
• Risks
• Understand necessity for Data Culture
Data Warehouse /
DHBWDaimler TSS 5
Data Catalog
Standard Data Warehouse architecture
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
Daimler TSS Data Warehouse / DHBW 6
Making it easier to discover datasetsavailable since 05-sep-2018
Source: Google announcement https://www.blog.google/products/search/making-it-easier-discover-datasets/
Daimler TSS Data Warehouse / DHBW 7
Making it easier to discover datasetsavailable since 05-sep-2018
Daimler TSS Data Warehouse / DHBW 8
Find the right data
• With data science and analytics on the rise and under way to being democratized, the importance of being able to find the right data to investigate hypotheses and derive insights is paramountSource: https://www.zdnet.com/article/google-can-now-search-for-datasets-first-research-then-the-world
• Google Dataset search helps to find external data
• Schema.org defines open metadata format; dataset itself may not be open/free
• Search engines can interpret the format
• Ranking of data
• Help users discover where the data is and user can access it directly from the source
What about internal data?
Daimler TSS Data Warehouse / DHBW 9
What is metadata?
Data
about
other data
Daimler TSS Data Warehouse / DHBW 10
Types of metadata
Business Metadata
• Business terms
• Domain model
• Glossary, etc
Legal Metadata
• Sensitive data
• GDPR
• Security
classification
Profiling
• Density
• Cardinality
• Min, Max, Median
• Popularity / Ranking
Technical Metadata
• Schema
• Table
• Column
• etc
Tagging / Linkage
Critical parts
Does Metadata Management provide answers to such questions across the whole workflow?
Search for data Work with data
Find Understand Trust Access Write
How to get access
to the data?
What tables are
important?
What table contains
production dates?
What is the difference
between production_date
and prod_dt?
How is this
column
calculated?
How to join the
tables?
Is FIN unique?
Who knows about
the data?
Is the data
reliable?
Daimler TSS Data Warehouse / DHBW 12
Cataloging source systemsMany formats = many connectors
• RDBMS (Oracle, Db2, SQL Server, Teradata, …)
• Hadoop (HDFS, Hive, …; on-premises, Cloud)
• NoSQL DBs
• Files (Excel, csv, …)
• Powerdesigner, Erwin, and other data modeling tools
Daimler TSS Data Warehouse / DHBW 13
Metadata differences: both tables contain same information, but different metadata
Vehicle_id Engine_id Cab_id Axle_id
WDB123 1234 ABCD XY12
Vehicle_id Type Id
WDB123 Engine 1234
WDB123 Cab ABCD
WDB123 Axle XY12
Daimler TSS Data Warehouse / DHBW 14
Metadata import used to be simple with RDBMS
Where is the data and
where is the
metadata in this
logfile?
Data Lake:
decentralized control
of the data
Daimler TSS Data Warehouse / DHBW 15
Data Lake / Hadoop
• Easy approach: Access Hive Metastore and import metadata
• Prerequisite: all data/files in HDFS require Hive access
• But unrealistic prerequisite
• Many logs are just dumped into the file system
• Interpreting ALL files by Catalog SW unrealistic, too.
• Huge computing power
• Huge number of variations (Cloud, on-premises, SW versions) lacks support of vendors for Catalog SW
• Sources should deliver metadata
Daimler TSS Data Warehouse / DHBW 16
Cataloging @google
Source: https://ai.google/research/pubs/pub45390
Heavy usage of
Automation
and
Machine Learning
Daimler TSS Data Warehouse / DHBW 17
Cataloging at Netflix, Twitter, Linkedin, etc.
Company Link
Netflix (Metacat)https://medium.com/netflix-techblog/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520
https://github.com/Netflix/metacat
Twitterhttps://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
LinkedIn (WhereHows)https://github.com/linkedin/WhereHows
https://github.com/linkedin/WhereHows/wiki
Google (Goods)https://ai.google/research/pubs/pub45390
https://www.buckenhofer.com/2016/10/goods-how-to-post-hoc-organize-the-data-lake/
Uberhttps://eng.uber.com/databook/
ebayhttps://www.ebayinc.com/stories/blogs/tech/bigdata-governance-hive-metastore-listener-for-apache-atlas-use-cases/
Lyfthttps://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9
Daimler TSS Data Warehouse / DHBW 18
Cataloging @uber
Source: https://eng.uber.com/databook/
Daimler TSS Data Warehouse / DHBW 19
Cataloging @twitter
Source: https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.htmlDaimler TSS Data Warehouse / DHBW 20
Cataloging @linkedin (open source)
Source: https://github.com/LinkedIn/Wherehows/wiki
Daimler TSS Data Warehouse / DHBW 21
Catalogs are everywhere … Google, Amazon
USER EXPERIENCEINVENTORY
Daimler TSS Data Warehouse / DHBW 22
Inventory vs user experience
Suppliers provide inventory
•A Catalog should list everything that is actually available
Consumers require user experience
•A Catalog should provide data usage statistics, ratings, data samples, statistical profiles, lineage, lists of users and stewards, and tips on how the data should be interpreted
Daimler TSS Data Warehouse / DHBW 23
Automation, crowd knowledge, and experts
• limitation of permissions to a trusted group
• A trusted group documents few datasets very well
• But most of the metadata is not documented
• Failure of many past approaches
• Automation, crowd knowledge and experts required
• Automation to get a broad coverage and use existing information like query logs
• Crowd to increase broad coverage
• Experts to confirm or reject „guesses“
-> Combination of coverage and accuracy
Daimler TSS Data Warehouse / DHBW 24
Types of metadata
Business Metadata
• List of business
terms
• Glossary, etc
Legal Metadata
• Sensitive data
• GDPR
• Security
classification
Profiling
• Density
• Cardinality
• Min, Max, Median
• Popularity / Ranking
Technical Metadata
• Schema
• Table
• Column
• etc
Tagging / Linkage
Critical parts
Automation, crowd
knowledge and experts
required
Data Catalog – Amazon for information
Data Catalog
Technical Metadata
Business Metadata
Collective Intelligence
Expert Sourcing
Data Access
Governance
MachineLearning
Automation
Inventory
User experience
& enrichment
Daimler TSS Data Warehouse / DHBW 26
Catalog search
Daimler TSS Data Warehouse / DHBW 27
Is the data Catalog a “metadata management reloaded”?
Name it as you like, but there are some critical developments
• Automation, Collective intelligence and expert knowledge
• Enable crowd sourcing and get help from other users
• Help to understand quality of data and usage of datasets
• Rating of information
• Web application for search / collaboration and API to access metadata
• Governance and legal framework for e.g. GDPR scenarios
• Capture metadata for security and end-user data consumption
• Identify the owner of the dataset and get access to source data
Daimler TSS Data Warehouse / DHBW 28
Data Warehouse /
DHBWDaimler TSS 29
Frontend
Standard Data Warehouse architecture
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
Daimler TSS Data Warehouse / DHBW 30
Visualization in the usual case of life
Daimler TSS Data Warehouse / DHBW 31
Russian Campaign of Napoleon
Source: https://de.wikipedia.org/wiki/Charles_Joseph_Minard
Daimler TSS Data Warehouse / DHBW 32
Mapping the 1854 London Cholera Outbreak
Source: https://www1.udel.edu/johnmack/frec682/cholera/
Daimler TSS Data Warehouse / DHBW 33
Mapping the 1854 London Cholera Outbreak
Daimler TSS Data Warehouse / DHBW 34
Excercise: visualize as much as possible
Umsatz in €
2014 2015 2016
Kanada 16.000 14.000 17.000
England 8.000 9.000 8.000
Frankreich 7.000 4.000 5.000
USA 60.000 85.000 90.000
Deutschland 4.000 10.000 15.000
Australien 10.000 8.000 15.000
Umsatz 105.000 130.000 150.000
Daimler TSS Data Warehouse / DHBW 35
Possible Solution 1
Daimler TSS Data Warehouse / DHBW 36
Possible Solution 2
Daimler TSS Data Warehouse / DHBW 37
Interface to the end user
• Reporting (Standard, ad-hoc)
• OLAP interactive analysis
• Dashboards, Scorecards
• Advanced Analytics / Data Mining / Text Mining
• Search & Discovery
Daimler TSS Data Warehouse / DHBW 38
Reporting (Standard, ad-hoc)
• Standard Reports
• Prepared static reports that can be executed at request by end users
• Are executed at the end of an ETL process and e.g. send by email to end users
• Normally based on fact tables and its dimensions
• Reports are often lists similar to Excel-Sheets but can also contain graphics (e.g. line charts)
• Ad-hoc Reports
• End users create their own reports („Self service“)
Daimler TSS Data Warehouse / DHBW 39
OLAP interactive analysis
ROLAP / MOLAP Client Frontend
• Prepared cubes (multidimensional or relational fact tables)
• User can perform interactive analysis of data
• Rollup / drill-down
• Pivot
• Slicing
• Dicing
Daimler TSS Data Warehouse / DHBW 40
Dashboards, Scorecards
• „Progress reports“
• Provide an overall view of KPIs (Key Performance Indicators)
• Combination of several elements from Reporting and/or OLAP (e.g. line charts) into an overall view (like a „cockpit“)
• Dashboard is more focused on operational goals
• High-level overview what is happening
• Scorecard is more focused on strategic goals
• Plan a strategy and identify why something happens
Daimler TSS Data Warehouse / DHBW 41
Software is eating the world
Machine learning will eat software• For many tasks, it’s easier to collect the data than to explicitly write the
program, e.g. face recognition or chess/go
• On the other hand, data collection isn’t always easy, e.g. billing SW
Advanced Analytics / Data Mining / Text MiningThe future of software development?
Source: https://www.oreilly.com/ideas/what-machine-learning-means-for-software-development
Daimler TSS Data Warehouse / DHBW 42
Machine Learning will change SW development
• Google’s Jeff Dean has reported that 500 lines of TensorFlow code has replaced 500,000 lines of code in Google Translate
• Don’t understate the difficulty of training a neural network of any complexity, but neither should we underestimate the problem of managing and debugging a gigantic codebase
• The developer has to become a teacher, a curator of training data, and an analyst of results
Source: https://www.oreilly.com/ideas/what-machine-learning-means-for-software-development
Source: https://twitter.com/DynamicWebPaige/status/915326707107844097
Data Warehouse / DHBW 43Source: https://twitter.com/markmadsen/status/1194622452430712833?s=20
Search & Discovery
• Not just numerical data
• Analysis of new data types gets more and more important
• Text
• GPS coordinates
• Pictures
• Videos
• Data can be available in RDBMS (e.g. text modules/indexes available), Hadoop or SQL DBs
Daimler TSS Data Warehouse / DHBW 44
Many graphical elements to use in reports
Source: https://github.com/d3/d3/wiki/GalleryDaimler TSS Data Warehouse / DHBW 45
Source: Hichert / Faisst, http://www.backup-page.hichert.com/
Many graphical elements … chamber of horror
Daimler TSS Data Warehouse / DHBW 46
Many graphical elements … chamber of horror
Some remarks about previous slides
• 3D elements introduce clutter and give not more information
• Pie chart most often does not make sense
• Line chart barely readable
• Labels are placed outside of the graphic
• Tachometer costs a lot of space and show
• Too much color in general
• Color without meaning, e.g. red should be used for alarms / errors
Daimler TSS Data Warehouse / DHBW 47
Do you use 3D usually ?
Daimler TSS Data Warehouse / DHBW 48
36 + 89 + 57 + 61 = 100
Source: BIKE magazine 8/2019
Daimler TSS Data Warehouse / DHBW 49
Did you know? Pizza is a real-time chart of how muchpizza is left
Daimler TSS Data Warehouse / DHBW 50
Story telling with appropriate visualization
Famous example by Hans Rosling (watch 3:08 onwards)
https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen?language=de
Daimler TSS Data Warehouse / DHBW 51
Daimler TSS Data Warehouse / DHBW 52
InfoGraphics - Chart Guide
Source: https://chart.guide/poster
Daimler TSS Data Warehouse / DHBW 53
Top infographics - The Forces Shaping the Future of the Global Economy
Source: https://www.visualcapitalist.com/our-top-infographics-of-2018/
Daimler TSS Data Warehouse / DHBW 54
Top infographics - The Forces Shaping the Future of the Global Economy: details
Source: https://www.visualcapitalist.com/the-8-major-forces-shaping-the-future-of-the-global-economy/
Information Design
• Information design is the practice of presenting information in a way that fosters efficient and effective understanding of it.(source: Wikipedia, https://en.wikipedia.org/wiki/Information_design )
• Some authors are well known for their criticism of many graphical representations - they provide rules for good information design
• Edward Tufte
• Stephen Few
• Rolf Hichert
Daimler TSS Data Warehouse / DHBW 55
Which productgroup has the highest win in june?
Daimler TSS Data Warehouse / DHBW 56
Which productgroup has the highest win in june?Eye tracking
Daimler TSS Data Warehouse / DHBW 57
Which productgroup has the highest win in june?Improved version
Daimler TSS Data Warehouse / DHBW 58
Which productgroup has the highest win in june?Eye tracking
Daimler TSS Data Warehouse / DHBW 59
Information DesignReduce to the essentials
Define standards, e.g.
• use always the same colors and with care, e.g.
• red = negative
• green = positive
• pie charts are rarely useful and should be avoided
• better use bar chart or line chart
• No 3D elements as these elements don’t enhance information but introduce clutter
• Standardize abbreviations, e.g. PY = previous year
Daimler TSS Data Warehouse / DHBW 60
Eye-tracking - before and after
Daimler TSS Data Warehouse / DHBW 61
Table with integrated bar charts
Source: Hichert, http://www.hichert.com/de/resource/table-template-02/
Daimler TSS Data Warehouse / DHBW 62
BI end user roles
Consumers / BI Users
• use reports, OLAP and dashboards to obtain information
Power Users
• Use reports , OLAP and dashboards to obtain information
• Create new reports and dashboards
Data Scientists
• Statistical / mathematical geeks
• Analyze / explore data
• Need to analyze raw (non-cleansed, non-transformed) data
Daimler TSS Data Warehouse / DHBW 63
Daimler TSS Data Warehouse / DHBW 64
Summary
• Infographics
• Comprehensive graphics
• Storytelling
• Information Design
• Use information efficiently and effectively
Data Warehouse /
DHBWDaimler TSS 65
Data Security
Standard Data Warehouse architecture
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
Daimler TSS Data Warehouse / DHBW 66
GDPR – general data protection regulation
Regulation in EU law on data protection and privacy for all individuals in the EU
Requirements like
• Data protection by design and by default
• Right to erasure / Right to be forgotten
• Right to data portability
• Records of processing activities
Daimler TSS Data Warehouse / DHBW 67
Challenges DWH & Big Data
• "capture-it-all" approach causes serious questions of privacy
• always ask yourself why you are capturing or storing data
Source: https://martinfowler.com/bliki/Datensparsamkeit.html
Challenges like
• Combining of sensitive data
sources allowed?
• Export restrictions that
forbid to combine data from
Germany, US, China
Daimler TSS Data Warehouse / DHBW 68
Daimler TSS Data Warehouse / DHBW 69
Data processing principles (GDPR Art. 5.1)
Principle Description
Lawfulness, fairness and transparency
(Rechtmäßigkeit, Verarbeitung nach Treu
und Glauben, Transparenz)
Processing must be according to law and
comprehensible for individuals
Purpose limitation (Zweckbindung) Use data for defined purpose only
Data minimisation (Datenminimierung) Data usage must be adequate
Accuracy (Richtigkeit) Personal Data needs to be correct
Storage limitation (Speicherbegrenzung) Keep data no longer than necessary
Integrity and confidentiality (Integrität und
Vertraulichkeit)
Protection against unauthorised or unlawful
processing and against accidental loss,
destruction or damage
Pseudonymization vs Anonymization
Personal data
Pseudonymized data are
data that allow re-
identification
Anonymized data are data
where the data subject is
no longer identifiable
Pseudonymization (e.g. Hashing)
Anonymization
Re-Identification with additional informationen
Name
VIN
IP-Adress
Home town
Telematic data
Daimler TSS Data Warehouse / DHBW 70
Anonymization by deleting data
Name Email Gender Birthdate ZIP Salary
Peter Miller [email protected] M 15.05.2001 89075 80.000
Martin Bush [email protected] F 22.07.1967 70079 85.000
Susan Dill [email protected] F 03.11.1978 60067 90.000
Daimler TSS Data Warehouse / DHBW 71
Anonymization by deleting data
Name Email Gender Birthdate ZIP Salary
M 15.05.2001 89075 80.000
F 22.07.1967 70079 85.000
F 03.11.1978 60067 90.000
Daimler TSS Data Warehouse / DHBW 72
People in the US can be identified in 87% by
gender, birthdate and zip (Latanya Sweeney)
https://dataprivacylab.org/projects/identifiability/paper1.pdf
Risiko Test: https://cpg.doc.ic.ac.uk/individual-risk/
Facebook Data Leak
• Personality-App
• Data from 3 Mio users
• Data were anonymized according to Facebook
• Gender, birthdate, and zip ↯ ↯ ↯
• Results from personality tests
• (Additionally, username and password were available to access the data)
Source: https://www.newscientist.com/article/2168713-huge-new-facebook-data-leak-exposed-intimate-details-of-3m-users/
Daimler TSS Data Warehouse / DHBW 73
Fine for false anonymization
Quelle: https://edpb.europa.eu/news/national-news/2019/danish-data-protection-agency-proposes-dkk-12-million-fine-danish-taxi_en
Daimler TSS Data Warehouse / DHBW 74
CIA triad – data classification
Confidentiality
IntegrityAvailability
whether or not
information is kept
secret or private,
e.g. data theft
whether the
information is kept
accurate, e.g.
faking data
ensuring that
information is
available when it is
needed, e.g.
blackout
Daimler TSS Data Warehouse / DHBW 75
Typical security functionalities and measures in databases
Auditing
Encryption
Masking (static, dynamic)
Row Level Security
Authentication & Password
policies
Patching
Secure installation
SSL – secure
communication
Authorization & roles,
profiles, ressource limits
Security checklists
Daimler TSS 77
Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
Data Warehouse /
DHBWDaimler TSS 79
Data Culture
Data quality and Meta Data management
Domain knowledge
Data culture
Digitization hot spotsBiMa-studie 2018 (BARC + sopra steria consulting)
Daimler TSS Data Warehouse / DHBW 80
Digitization - New giants „BAT“Baidu (Google) + Alibaba (Amazon) + Tencent (Facebook)
Sources:
https://venitism.wordpress.com/2017/12/15/beware-of-the-bats-baidu-alibaba-and-tencent/
https://www.afr.com/brand/business-summit/baidu-alibaba-tencent-to-disrupt-facebook-amazon-netflix-google-in-asia-20180228-h0wrdl
Daimler TSS Data Warehouse / DHBW 81
Stephen Few: Big Data, Big Dupe (Analytics Press, 2018)
• “Taping into the potential of data involves data sensemaking. I prefer the term over the more popular term analytics because it better fits the full range of activities that are needed” (page 44)
• “Data sensemaking requires a significant investment in the development of human skills”(page 74)
• DWH/Big Data Architecture (instead of technology/tools)
• Data integration, Data visualization, Data modeling, …
Daimler TSS Data Warehouse / DHBW 82
Data landscape manifesto@scout24
Source: Data Festival, Munich 2019
Daimler TSS Data Warehouse / DHBW 83
Netflix’ three credos for data culture
Source: Phil Simons’ webinar on “The visual organization: data visualization, big data, and the quest for better decisions” from
Harvard Business Review (2014)
Data catalog: Data should be accessible, easy to discover, and easy to process for everyone
Data storytelling: Whether your dataset is large or small, being able to visualize it makes it easier to explain
Fail fast and iterate: The longer you take to find the data, the less valuable it becomes
Daimler TSS Data Warehouse / DHBW 84
From DevOps to DataOps
A collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and consumers across an organization.
The goal of DataOps is to create predictable delivery and change management of data, data models and related artifacts. DataOps uses
technology to automate data delivery with the appropriate levels of security, quality and metadata to improve the use and value of data in a dynamic environment.
Source: https://medium.com/data-ops/the-best-dataops-articles-of-q3-2018-c39882be3d7b
Daimler TSS Data Warehouse / DHBW 85
Data Culture
Establish a culture for
• Sharing data across the organization by following privacy and ethics
• Collaborating around data products and data platforms
• Data-driven decisions instead of e.g. experience or best paid employee in the room
• Speed: fail fast and iterate
Being successful as a data-driven company requires the active involvement of all employees
Daimler TSS Data Warehouse / DHBW 86
We’re entering a new world in which
data may be more important than
software.
[Tim O’Reilly, Founder O’Reilly Media]
Data is a precious thing and will last longer than
the systems themselves.
[Tim Berners-Lee, Father of the Worldwide Web]
Information is the oil of the 21st
century
[Peter Sondergaard, Gartner]
Everything we do in the digital realm ... creates a data trail.
And if that trail exists, chances are someone is using it.
[Douglas Rushkoff, Author]
Data creation is exploding
[Gavin Belson, HBOs Silicon Valley]
Data is the new gold
[Open Data Initiative, European Commission]
In a world deluged by irrelevant
information, clarity is power.
[Yuval Noah Harari, Author]
Big data is not about the data
[Gary King, Harvard University]
Data WarehousE
Data Warehouse /
DHBWDaimler TSS 89
Applications come, applications go.
The data, however, lives forever.
It is not about building applications;
it really is about the data underneath these applications
(Tom Kyte, Oracle)
Daimler TSS GmbH
Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com
Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle
© Daimler TSS I Template Revision
Visual Vocabulary
https://github.com/ft-interactive/chart-doctor/tree/master/visual-vocabulary
http://www.vizwiz.com/2018/07/visual-vocabulary.html
Daimler TSS Data Warehouse / DHBW 91
Learn how to replace code
• Machine learning will no doubt change software development in significant ways
• Software developers will put much more effort into data collection and preparation
• Developers will have to do more than just collect data; they’ll have to build data pipelines and the infrastructure to manage those pipelines. We’ve called this “Data Engineering”
• Data engineers will be responsible for maintaining the data pipeline: ingesting data, cleaning data, feature engineering, and model discovery
Source: https://www.oreilly.com/ideas/what-machine-learning-means-for-software-developmentDaimler TSS Data Warehouse / DHBW 92
Types of metadata (1)
Business Metadata
• Definition of business vocabulary and relationships
• Definition of the value range
• Linkage to physical representation
Daimler TSS Data Warehouse / DHBW 93
Types of metadata (2)
Report and ETL metadata
• Report definitions
• Data sources
• Column definitions
• Computations
Logical and physical metadata of data model
• Table structure
• Definition of columns
• Relationships between tables and columns
• Dimension hierarchy Daimler TSS Data Warehouse / DHBW 94
Benefits of Metadata management
Source: Detlef Apel: Datenqualität erfolgreich steuern, dpunkt 2015, chapter 14
• Data Lineage and dependencies
• Generating and controlling DWH processes
• Improve SW development quality
• Increase comprehensibility of KPIs
Daimler TSS Data Warehouse / DHBW 95
Technical Metadata managementvery often not successful
Metadata Repository
OLTP-1
OLTP-2
Microservice-1Microservice-1
Microservice-1Microservice-1
DWH
Data Lake
Who enriches / tags
technical metadata
with
legal and business
relevant information???
Daimler TSS Data Warehouse / DHBW 96
Data Catalog a hot topic
• New Data Catalog vendors are entering the market
• Established vendors rebrand and enrich their existing toolsDaimler TSS Data Warehouse / DHBW 97
Alation architecture
Not just an RDBMS for structured
metadata, but also storage engines for text
data
Daimler TSS Data Warehouse / DHBW 98
Data scientist sexiest job of the 21st century?
Source: https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientist-articleDaimler TSS Data Warehouse / DHBW 99
SQL standard is evolving
Source: https://vimeo.com/289497563
Daimler TSS Data Warehouse / DHBW 100
Exercise: compute most recent rows
Write an SQL statement that computes the most recent data for each customer.
Script to create the table including data: https://github.com/abuckenhofer/dwh_course/tree/master/scripts
Customer_
key
Name Status Valid_from
1 Brown Single 01-MAY-2014
2 Bush Married 05-JAN-2015
1 Miller Married 15-DEC-2015
3 Stein 15-DEC-2015
3 Stein Single 18-DEC-2015
SIN1.sqlDaimler TSS Data Warehouse / DHBW 101
Exercise: compute most recent rows
Daimler TSS Data Warehouse / DHBW 102
Exercise: compute most recent rows
create table s_customer (customer_key integer NOT NULL
, cust_name varchar2(100) NOT NULL, status varchar2(10), valid_from date NOT NULL, CONSTRAINT s_customer_pk PRIMARY KEY (customer_key, valid_from)
);
insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Brown', 'Single', to_date('01.05.2014', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (2, 'Bush', 'Married', to_date('05.01.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Miller', 'Married', to_date('15.12.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein', NULL, to_date('15.12.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein', 'Single', to_date('18.12.2015', 'DD.MM.YYYY'));
commit;
Daimler TSS Data Warehouse / DHBW 103
Exercise: compute most recent rowssolution 1: max-function
SELECT s.*
FROM S_CUSTOMER s
JOIN (SELECT i.customer_key,
max(i.valid_from) as max_valid_from
FROM S_CUSTOMER i
GROUP BY i.customer_key) b
ON s.customer_key = b.customer_key
AND s.valid_from = b.max_valid_from;
S2IN.sqlDaimler TSS Data Warehouse / DHBW 104
Exercise: compute most recent rowssolution 2: exists
SELECT s.*
FROM S_CUSTOMER s
WHERE NOT EXISTS (SELECT 1
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key
AND s.valid_from < i.valid_from);
S2IN.sqlDaimler TSS Data Warehouse / DHBW 105
Exercise: compute most recent rowssolution 3: max in correlated sub-select
SELECT s.*
FROM S_CUSTOMER s
WHERE s.valid_from = (SELECT MAX(i.valid_from)
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key);
S2IN.sqlDaimler TSS Data Warehouse / DHBW 106
Exercise: compute most recent rowssolution 4: coalesce with sub-select
SELECT *
FROM (SELECT coalesce ((SELECT min (i.valid_from)
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key
AND s.valid_from < i.valid_from
), to_date ('31.12.9999', 'DD.MM.YYYY'))
as end_ts,
s.*
FROM S_CUSTOMER s)
WHERE end_ts = to_date ('31.12.9999', 'DD.MM.YYYY');
S2IN.sqlDaimler TSS Data Warehouse / DHBW 107
Exercise: compute most recent rowssolution 5: max-function and with-clause
WITH max_cust as (
SELECT i.customer_key,
max(i.valid_from) as max_valid_from
FROM S_CUSTOMER i
GROUP BY i.customer_key)
SELECT s.*
FROM S_CUSTOMER s
JOIN max_cust b ON s.customer_key = b.customer_key
AND s.valid_from = b.max_valid_from;
S2IN.sqlDaimler TSS Data Warehouse / DHBW 108
partition data
compute functions over these partitions
Rank [sequential order], first [first row], last [last row], lag [previous row], lead [next row]
return result
Exercise: compute most recent rowsSQL Analytic / Windowing functions
Daimler TSS Data Warehouse / DHBW 109
Exercise: compute most recent rowssolution 6: Analytic / Windowing function lead
WITH lead_cust as (
SELECT lead (s.valid_from, 1) OVER (PARTITION BY
s.customer_key
ORDER BY s.valid_from ASC) as end_ts
, s.*
FROM s_customer s)
SELECT *
FROM lead_cust b
WHERE b.end_ts IS NULL;
S3IN.sqlDaimler TSS Data Warehouse / DHBW 110
Exercise: compute most recent rowssolution 7: Analytic function row_number
WITH lead_cust as (
SELECT row_number() OVER(PARTITION BY s.customer_key
ORDER BY s.valid_from DESC) as rn
, s.*
FROM s_customer s)
SELECT *
FROM lead_cust b
WHERE b.rn = 1;
S3IN.sqlDaimler TSS Data Warehouse / DHBW 111
Max or Analytic / Windowing functions Which alternative would you recommend?
• Check execution plans, execution time including service + response time, resource usage for final decision
• Solutions with Analytic / Windowing do not need self-join and show better statistics compared to the other shown solutions
• Analytic / Windowing functions are very powerful
• Remark: Usage of with-clause in SQL statements is preferable compared to sub-selects as it improves readability, understandability, maintainability
Daimler TSS Data Warehouse / DHBW 112
Temporal data storage (Bitemporal data)
Daimler TSS Data Warehouse / DHBW 113
Temporal data storage (Bitemporal data)
10.09. 20.09. 30.09. 10.10.
Time
Price: 15EUR Price: 16EUR
New Price of 16EUR is
entered into the DB
Valid
Time
(20.09.)
Transaction
Time
(10.09.)
Daimler TSS Data Warehouse / DHBW 114
• Time period when a fact is true in the real world
• The end user determines start and end date/time (or just a date/time for events)
Business validity:
Valid time
• Time period when a fact stored in the database is known
• ETL process determines start and end date/time
Technical validity:Transaction time
• Combines both Valid and Transaction TimeBitemporal data
Temporal data storage (Bitemporal data)Definition
Daimler TSS Data Warehouse / DHBW 115
Temporal data storage (Bitemporal data)SQL standard
• SQL standard SQL:2011
• But different implementations by RDBMSes like Oracle, Db2, SQL Server and others
• Different syntax!
• Different coverage of standard!
• Very useful for slowly changing dimensions type 2, but also for other purposes
Daimler TSS Data Warehouse / DHBW 116
Db2 Valid Time example
CREATE TABLE customer_address
( customerID INTEGER NOT NULL
, name VARCHAR(100)
, city VARCHAR(100)
, valid_start DATE NOT NULL
, valid_end DATE NOT NULL
, PERIOD BUSINESS_TIME(valid_start, valid_end)
, PRIMARY KEY(customerID, BUSINESS_TIME WITHOUT OVERLAPS) );
Daimler TSS Data Warehouse / DHBW 117
Db2 Valid Time example
INSERT INTO customer_address VALUES
(1, 'Miller', 'Seattle', '01.01.2013', '31.12.2013');
UPDATE customer_address FOR PORTION OF BUSINESS_TIME
FROM '22.05.2013' TO '31.12.2013'
SET city = 'San Diego' WHERE customerID = 1;
customerID Name City Valid_start Valid_end
1 Miller Seattle 01.01.2013 22.05.2013
1 Miller San Diego 22.05.2013 31.12.2013
Daimler TSS Data Warehouse / DHBW 118
Db2 Valid Time example
SELECT *
FROM customer_address
FOR BUSINESS_TIME AS OF '17.05.2013';
Daimler TSS Data Warehouse / DHBW 119
Db2 transaction Time example
CREATE TABLE customer_info(
customerId INTEGER NOT NULL,
comment VARCHAR(1000) NOT NULL,
sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN,
sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,
PERIOD SYSTEM_TIME (sys_start, sys_end)
);
Daimler TSS Data Warehouse / DHBW 120
Db2 transaction Time example
Transaction on 15.10.2013:
INSERT INTO customer_info VALUES( 1, 'comment 1');
Transaction on 31.10.2013
UPDATE customer_address SET comment = 'comment 2'
WHERE customerID = 1;
CustomerI
d
comment Sys_start Sys_end
1 Comment
2
31.10.2013 31.12.2999
Daimler TSS Data Warehouse / DHBW 121
Db2 transaction Time example
SELECT *
FROM customer_info FOR SYSTEM_TIME AS OF '17.10.2013';
Data comes from a history table:
Valid Time and Transaction Time can be combined = Bitemporal table
CustomerId comment Sys_start Sys_end
1 Comment 1 15.10.2013 31.10.2013
Daimler TSS Data Warehouse / DHBW 122
Indexing - why
• Very important performance improvement technique
• Good for many reads with high selectivity, write penalty
• B-trees most common
root
branch branch
leaf leaf leaf
…
…
Table
Daimler TSS Data Warehouse / DHBW 123
Indexing a star schema – which columns are candidates for an index?
• DBs index Primary Keys by default
• Dimension table columns that are regularly used in where clausesare candidates
• Maybe foreign Key columns in Fact table (see also later Star Transformation)
Daimler TSS Data Warehouse / DHBW 124
Star transformation
• Fact table has normally much more rows compared to dimension tables
• Common join techniques would need to join first dimension table with the fact table
• Alternative technique: evaluate all dimensions(cartesian join)
• Then join into fact table in last step
• Oracle uses Bitmap indexes on foreign key columns in fact tables to achieveStar Join; not supported by many DBs
Daimler TSS Data Warehouse / DHBW 125
Daimler TSS Data Warehouse / DHBW 126
Partitioning
Col1 Col2 Col3 col4
1 A AA AAA
2 B BB BBB
3 C CC CCC
Col1 Col2
1 A
2 B
3 C
Col3 col4
AA AAA
BB BBB
CC CCC
Col1 Col2 Col3 col4
3 C CC CCC
Col1 Col2 Col3 col4
1 A AA AAA
2 B BB BBB
Vertical partitioning Horizontal partitioning
Horizontal partitioning
• Very powerful feature in a DWH to reduce workload
• Split table into logical smaller tables
• Avoidance of full table scans
• How could a table be split?
• Introduction to (Oracle) partitioning: https://asktom.oracle.com/partitioning-for-developers.htm
Daimler TSS Data Warehouse / DHBW 127
Horizontal Partitioning – splitting options
• By range
• Most common
• Use date field like order data to partition table into months, days, etc
• By list
• Use field that has limited number of different values, e.g. split customer data by country if end users most likely select customers from within a country
• By hash
• Use a filed that most likely splits the data in evenly distributed chunks
Daimler TSS Data Warehouse / DHBW 128
Parallelism
• Statements are normally executed on one CPU
• Parallelism allows the DB to distribute the execution to several CPUs
• Powerful combination with partitioning
• Parallelism is limited by the number of CPUs: if parallelism is too high, performance will degrade
• Intra-query parallelism and inter-query parallelism
Daimler TSS Data Warehouse / DHBW 129
Compression
• Data compression + Index compression
• Store more data in a block/page = read more data during I/O
• If CPU resources are available, often a very powerful feature to improve performance
• Additionally reduce storage
• Additionally reduce backup time
Daimler TSS Data Warehouse / DHBW 130
Already covered in a previous lecture
• Relational columnar In-Memory DB
• Materialized Views / Query Tables
Daimler TSS Data Warehouse / DHBW 131
Exercise - recapture ETL and DB specifics
• Recapture ETL and DB specific topics
• Which topics do you remember, or do you find important?
• Write down 1-2 topics on stick-it cards.
Daimler TSS Data Warehouse / DHBW 132
Working with Alation
• Onboarding of data sources
• IT just creates cover and grants access
• Self-service: Done by application owners including source system connection
• No need for central password management
• Scales for onboarding of many systems
Daimler TSS Data Warehouse / DHBW 133
Schema and its tables
Daimler TSS Data Warehouse / DHBW 134
Table and its columns with sample data
Daimler TSS Data Warehouse / DHBW 135
Columns and relationships
Daimler TSS Data Warehouse / DHBW 136
Legal tagsGDPR and other regulations
Associate legal tags
• Articles 16-21
• Identify data
• Right to erasure
• Right to be forgotten
Daimler TSS Data Warehouse / DHBW 137
Central vs local Data Catalogs
Central data Catalog
• Integrated views
• Mammoth task
• No redundancy
Local data Catalogs (reality)
• Legal requirements
• Feasibility
• Tool support very weak
Data
Cata
logSource 1
Source 2
Source 3
Data
Cata
log
Source 1
Source 2
Source 3
Data
Cata
log
Daimler TSS Data Warehouse / DHBW 138
Summaryinformation is beautiful
Source: https://www.youtube.com/watch?v=hOex1iU57iw
Daimler TSS Data Warehouse / DHBW 139
Data access
• End users should only have access to the Data Marts in a DWH
• Grant privileges and roles to users
• Good practice:
• Grant privileges directly to a role; do not nest roles (grant role to role)
• Grant role to users
• Distinguish end users, administrators and technical users with different password policies
• Much more difficult to grant access in a Data Lake
• Tool maturity
• Data often unknown with schema-on-read
Daimler TSS Data Warehouse / DHBW 140
Evaluation criteria
Technical Metadata
Business Metadata incl. Glossary
Tagging (Linkage)
Collective Intelligence(Collaboration)
Search
Security
Source connectors
Data profiling
Data access
Lineage
API
Versioning
Architecture
Components
Prerequisites
LicencingDaimler TSS Data Warehouse / DHBW 141
Over 75%• of time is spent for
• say they least enjoy
DATA PREPARATION
Data Consumers