Upload
meljun-cortes
View
590
Download
9
Embed Size (px)
DESCRIPTION
MELJUN CORTES Fundamentals of Enterprise Data Management Week 02
Citation preview
© 2013 IBM CorporationIBM Confidential
BAFEDM2: Fundamentals of Enterprise Data Management
Week 02
© 2013 IBM CorporationIBM Confidential
Agenda
2
Module 1: Introduction to Data Warehousing (continued)• Framework of the Data Warehouse• Data Warehouse Options
© 2013 IBM CorporationIBM Confidential
Module 1: Introduction to Data Warehousing (continued)
BAFEDM2: Fundamentals of Enterprise Data Management
3
© 2013 IBM CorporationIBM Confidential
. . .
Extract, Transform, Load (ETL)
Framework of the Data Warehouse: OLTP, OLAP, ODS
4
Source 1
Source 2
Source N
. . .
OLTP
OLAP Data Warehouse
Data Mart 1
Data Mart 2
Data Mart N
ODS Data Store Reports
© 2013 IBM CorporationIBM Confidential
Framework of the Data Warehouse: OLTP, OLAP, ODS (continued)
OLTP: On-Line Transaction Processing• A system that keeps track of an organizaion’s daily transactions and
updates the warehouse at periodic intervals• Involves frequent inserts, updates and deletes, highly volatile data, and
application-specific data• A class of systems which facilitate and manage transaction-oriented
applications, mainly data entry and retrieval transactions• Most of the systems that are used in the day-to-day businesses are of
OLTP type such as:–Order entry– Inventory management–Railway reservation system– Payroll or production tracking
5
© 2013 IBM CorporationIBM Confidential
Framework of the Data Warehouse: OLTP, OLAP, ODS (continued)
OLAP: On-Line Analytical Processing• A technology that uses a multi-dimensional view of aggregate data to
provide quick access to strategic information for further analysis• Inserts, updates and deletes are periodic batc processes, non-volatile
data, and integrated and summarized data• Enables end-users to perform ad-hoc analysis of data in multiple
dimensions, thereby providing the insight and understanding they need for better decision making• Typical OLAP applications are:–Business reporting for sales, management–Budgeting and forecasting– Financial reporting
6
© 2013 IBM CorporationIBM Confidential
Framework of the Data Warehouse: OLTP, OLAP, ODS (continued)
ODS: Operational Data Store• A subject-oriented, integrated, volatile, current-valued data store
containing only corporate detailed data• ODS typically:– is meant only for operational systems– contains current value and near current values– contains detailed data– is meant for day-to-day decisions and operational activities
7
© 2013 IBM CorporationIBM Confidential
Framework of the Data Warehouse: OLTP, OLAP, ODS (continued)
8
Characteristic OLTP OLAP ODS
Used For Day-to-day transaction Information management in an enterprise
Operational activities and day-to-day decisions
Database Size Moderate Very large Moderate
Data Load Field by field Batch upload Field by field
Accessed By Operational users Executives, managers, and analysts
Analysts and operational users
Kind of Data Individual records Set of records Individual records
Type of Data Transaction Analysis Transaction and analysis
Methodology Operational requirements Evolutionary Data driven
Data Structure Detailed Highly summarized Detailed and lightly summarized
Data Organization Functional Subject-oriented Subject-oriented
Data Source Homogenous, application-centric
Heterogenous Homogenous
Data Redundancy Not redundant Managed redundancy Redundant to some extent
© 2013 IBM CorporationIBM Confidential
Framework of the Data Warehouse: Architecture
9
SOURCE SYSTEMS STAGING DATA STORES ANALYTICS
Legacy
Flat Files
Web
Enterprise Resource Planning (ERP)
Customer Relationship Management (CRM)
Supply Chain Management (SCM)
LANDING AREA
Source Metadata
Profile
Extract
Analyze
Cleanse
Transform
Integrate/Consolidate
ETL Metadata Technical Metadata Business Metadata
Quality
Load
Change Data Capture
Operational Data Store
Data Warehouse
Data Marts
Information on Demand
OLAP
Data Mining
Reporting
Action
Analysis
Extract, Transform, Load (ETL)
© 2013 IBM CorporationIBM Confidential
Source SystemsThe source systems of a data warehouse can be legacy data sources, ERP’s, simple flat files to complex SAP sources, or COBOL sources or other data sources like RDBMS, AS/400, web application data, etc. Commonly, these sources or operational data(OLTP data sources) are known as transactional data sources.•Gets input from: individual live application’s data•Tasks done here: application-specific transactional, point-in-time data load•Sends output to: either landing area or staging area
Landing AreaThe landing area is a volatile intermediate area for operational data before transformation takes place. This is implemented to insulate the OLTP systems from the developer, avoid access on load on online source system applications, and to abide by federal laws in some cases. Source system data is either pushed or pulled into the landing area in a pre-determined format from respective source systems. This data is then loaded into the staging area. In some scenarios it’s quite possible that data can directly be sourced from the source system area to the staging area instead of routing itthrough the landing area.•Gets input from: individual live application’s data in a pre-determined format•Tasks done here: application-specific transactional, point-in-time data load•Sends output to: staging area
Framework of the Data Warehouse: Architecture (continued)
10
SOURCE SYSTEMS
Legacy
Flat Files
Web
Enterprise Resource Planning (ERP)
Customer Relationship Management (CRM)
Supply Chain Management (SCM)
LANDING AREA
Source Metadata
© 2013 IBM CorporationIBM Confidential
Staging Area•Gets input from: landing area or individual source systems•Task done here: extraction, cleansing, transformation, integration, standardization of disparate source systems data to generate a complete and conformed record•Sends output to: volatile, integrated, point-in-time data moved to either operational data store, data warehouse, or data marts
Staging AreaThe staging area is a place where you hold temporary tables on the data warehouse server. We basically need a staging area to hold the data and perform data cleansing and merging, before loading the data into the warehouse. Sometimes, the staging area is also required to hold a subset of the source for data profiling activities.
Data quality (information quality) is defined as standardizing and consolidating customer and/or business data. By cleansing, enhancing, merging, scrubbing the data and combining/aggregation related records to avoid duplicate entries, you are able to create a single record view. The staging area can also hold reference and standardization tables.
Framework of the Data Warehouse: Architecture (continued)
11
STAGING
Profile
Extract
Analyze
Cleanse
Transform
Integrate/Consolidate
ETL Metadata
Quality
Load
Change Data Capture
Extract, Transform, Load (ETL)
© 2013 IBM CorporationIBM Confidential
Change Data CaptureChange Data Capture (CDC) is a set of software design patterns used todetermine the data that has changed in a database so that action can be taken on the changed data.
CDC solutions occurmostly in data warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse.
It can be in source, inlanding. or in staging area.
Extract, Transform, LoadExtract: extracts data from either landing area or directly from source systems using ETL tools preferably or using custom scripts.
Transform: transformation would involve the following:1.Analyze the Data2.Profile the Data (optional, required for data quality)3.Cleanse the Data4.Integrate the Data5.Standardize the Data6.Data Quality
Load: load integrated, complete and conformed system of record into either operational data store, data warehouse, or data marts.
Framework of the Data Warehouse: Architecture (continued)
12
STAGING
Profile
Extract
Analyze
Cleanse
Transform
Integrate/Consolidate
ETL Metadata
Quality
Load
Change Data Capture
Extract, Transform, Load (ETL)
© 2013 IBM CorporationIBM Confidential
Change Data Capture•Gets input from: source system’s databases, or ETL metadata repository, or log or journal entries in databases either on source, staging, or landing•Tasks done here: verification of records that have been either inserted (new), updated (modified), or deleted (removed)•Sends output to: perform necessary updates and deletes on target systems
Extract, Transform, Load•Gets input from: landing area or individual source systems or staging area•Tasks done here: extraction, cleansing, transformation, integration, standardization of disparate source systems data to generate a complete and conformed record; generates ETL medata•Sends output to: integrated, complete and conformed record moved to either operational data store, data warehouse, or data marts
Framework of the Data Warehouse: Architecture (continued)
13
STAGING
Profile
Extract
Analyze
Cleanse
Transform
Integrate/Consolidate
ETL Metadata
Quality
Load
Change Data Capture
Extract, Transform, Load (ETL)
© 2013 IBM CorporationIBM Confidential
ETL: Transform1.Analyze the Data2.Profile the Data3.Cleanse the Data4.Integrate the Data5.Standardize the Data6.Data Quality
Analyze the DataThis involves analysis of metadata and data values, and detection of differences between defined and inferred properties.
Profile the DataData profiling is a process to assess current data conditions, or to monitor data quality over time. It begins with collecting measurements about data, and then looking at the resultsindividually and in various combinations to see where anomalies exits.
Cleanse the DataData cleansing (also referred to as data scrubbing) is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. The term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying or deleting this dirty data.
Framework of the Data Warehouse: Architecture (continued)
14
STAGING
Profile
Extract
Analyze
Cleanse
Transform
Integrate/Consolidate
ETL Metadata
Quality
Load
Change Data Capture
Extract, Transform, Load (ETL)
© 2013 IBM CorporationIBM Confidential
ETL: Transform1.Analyze the Data2.Profile the Data3.Cleanse the Data4.Integrate the Data5.Standardize the Data6.Data Quality
Integrate the DataThis involves integration and consolidation of data from various source systems to form a single system of record. Essentially, to understand the completelifecycle of this product means to integrate these different records for these different processes into a single system of record.
Standardize the DataData standardization transforms different input formats into a consolidated output format; helps in:creating single domain fields, incorporating business, industry standards.
Data QualityWithout accurate data, users loose confidence in the data and make improper decisions. Data quality addresses issues like:•Business Rules violations, e.g., missing data, use of default (1, or 0, or 9999, or ?), data with logic embedded (e.g., item code starts with 1, product code starts with 9)•Data Integrity violations, e.g., duplicate primary key, one entity have different key identifiers, no reference data, multiple variation of same value
Framework of the Data Warehouse: Architecture (continued)
15
STAGING
Profile
Extract
Analyze
Cleanse
Transform
Integrate/Consolidate
ETL Metadata
Quality
Load
Change Data Capture
Extract, Transform, Load (ETL)
© 2013 IBM CorporationIBM Confidential
Operational Data StoreIs a subject-oriented, integrated, volatile, current-valued, detailed-only collection of data in support of an organization's need for up-to-the second, operational, integrated, collective information•Gets input from: staging area•Tasks done here: data storage for current period, alters key structures, reformats data, lightly summarizes data, recalculates data, queried by analysis•Sends output to: data warehouse and/or data marts
AudienceData Analysts
Data ModelEntity-Relationship (normalized), detailed and lightly summarized
Database SizeModerate
Data UpdateField by field
Philosophy Support day to day decisions and operational activities
Framework of the Data Warehouse: Architecture (continued)
16
DATA STORES
Technical Metadata
Operational Data Store
Data Warehouse
Data Marts
© 2013 IBM CorporationIBM Confidential
Data WarehouseIs a a subject oriented, integrated, non-volatile, time-variant, collection of data organized to support management needs•Gets input from: staging area or operational data stores•Tasks done here: data storage for historical period, alters key structures, reformats data, summarizes data, recalculates data•Sends output to: data marts
AudienceManager and analysts
Data ModelDimensional and summarized
Database SizeLarge to very large
Data UpdateBatch, controlled
Philosophy Support managing the enterprise
Framework of the Data Warehouse: Architecture (continued)
17
DATA STORES
Technical Metadata
Operational Data Store
Data Warehouse
Data Marts
© 2013 IBM CorporationIBM Confidential
Data MartIs a body of decision-support data for a department that has an architectural foundation of a data warehouse; can also represent abusiness process that can proliferate across manydepartments•Gets input from: staging area, operational data stores, or data warehouse•Tasks done here: summarization, key allocation, aggregation, de-normalization•Sends output to: analytics and business intelligence tools use this data for reporting and data mining
AudienceExecutives, manager and analysts
Data ModelDimensional and summarized
Database SizeModerate to large
Data UpdateBatch, controlled
Philosophy Operational efficiency
Framework of the Data Warehouse: Architecture (continued)
18
DATA STORES
Technical Metadata
Operational Data Store
Data Warehouse
Data Marts
© 2013 IBM CorporationIBM Confidential
AnalyticsAnalytics is defined as the extensive use of data, statistical and quantitative analyses, explanatory and predictive modeling, and fact-based management to drive decision making.
There are three types of analytics:•Descriptive analytics provides information about the past state or performance of a business and its environment. It provides regular reports for events that already happened and ad hoc reports to help examine facts about what happened, where, how often, and with how many. •Predictive analytics helps predict (based on data and statistical techniques) with confidence what will happen next so that you can make well-informed decisions and improve business outcomes. It uses simulation models to suggest what could happen.•Prescriptive analytics recommends high-value alternative actions or decisions given a complex set of targets, limits, and choices. It predicts future outcomes and suggests courses of actions to take so that you can benefit from those predictions.
Framework of the Data Warehouse: Architecture (continued)
19
ANALYTICS
Business Metadata
Information on Demand
OLAP
Data Mining
Reporting
Action
Analysis
© 2013 IBM CorporationIBM Confidential
Metadata“Two contractors are assigned a task of building a bridge. One is to start building from East end and the other is to start building from the West end. Both have to meet in the center and then merge. When they arrived at the center point, one end of the bridge was higher than the other by a few inches. This was because one group of contractors and their engineers used kilograms and meters, while another used pounds and feet. It caused the parent company losses in billions. Reason: it wasn’t the data that was faulty; it was the metadata.”
Metadata is “data about data.” It refers to data that tries to describe a data set in terms of its value, content, quality, and significance.It provides insight into data for information like:1.What kind of data?2.Who is the owner of this data?3.How was the data created?4.What are the attributes and significance of the data created or collected?
Framework of the Data Warehouse: Architecture (continued)
20
Source Metadata ETL Metadata Technical Metadata Business Metadata
© 2013 IBM CorporationIBM Confidential
Framework of the Data Warehouse: Architecture (continued)
21
SOURCE SYSTEMS STAGING DATA STORES ANALYTICS
Legacy
Flat Files
Web
Enterprise Resource Planning (ERP)
Customer Relationship Management (CRM)
Supply Chain Management (SCM)
LANDING AREA
Source Metadata
Profile
Extract
Analyze
Cleanse
Transform
Integrate/Consolidate
ETL Metadata Technical Metadata Business Metadata
Quality
Load
Change Data Capture
Operational Data Store
Data Warehouse
Data Marts
Information on Demand
OLAP
Data Mining
Reporting
Action
Analysis
Extract, Transform, Load (ETL)
© 2013 IBM CorporationIBM Confidential
Data Warehouse Options There are, perhaps, as many ways to develop data warehouses as there
are organization. Moreover, there are a number of key factors that need to be considered: scope, data redundancy, and type of end-user.• Scope. The scope of a data warehouse may be as broad as all the
informational data for the entire enterprise from the beginning of time, or it may be as narrow as a personal data warehouse for a single manager for a single year.• Data Redundancy. Virtual data warehouses allow end users to get at
operational databases directly; it provides the ultimate in flexibility as well as the minimum amount of redundant data that must be loaded and maintained. Central data warehouses are single physical databases that contains all data for a specific functional area, department, division, or enterprise. Distributed data warehouses are those in which certain components are distributed across a number of different physical databases.• Type of End-User. End-users can be broadly categorized into three: executives
and managers, power users (business and financial analysts, engineers), and support users (clerical, administrative).
22
© 2013 IBM CorporationIBM Confidential
For the Next SessionBAFEDM2: Fundamentals of Enterprise Data Management
23
© 2013 IBM CorporationIBM Confidential
For the Next Sessions
Agenda• Module 2: Data Warehouse Design Considerations–Data Models– The Dimensional Model– Facts and Dimensions– Four-Step Dimensional Design Process–Case Study: Retail
24