24
© 2013 IBM Corporation IBM Confidential BAFEDM2: Fundamentals of Enterprise Data Management Week 02

MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

Embed Size (px)

DESCRIPTION

MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

Citation preview

Page 1: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

BAFEDM2: Fundamentals of Enterprise Data Management

Week 02

Page 2: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Agenda

2

Module 1: Introduction to Data Warehousing (continued)• Framework of the Data Warehouse• Data Warehouse Options

Page 3: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Module 1: Introduction to Data Warehousing (continued)

BAFEDM2: Fundamentals of Enterprise Data Management

3

Page 4: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

. . .

Extract, Transform, Load (ETL)

Framework of the Data Warehouse: OLTP, OLAP, ODS

4

Source 1

Source 2

Source N

. . .

OLTP

OLAP Data Warehouse

Data Mart 1

Data Mart 2

Data Mart N

ODS Data Store Reports

Page 5: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Framework of the Data Warehouse: OLTP, OLAP, ODS (continued)

OLTP: On-Line Transaction Processing• A system that keeps track of an organizaion’s daily transactions and

updates the warehouse at periodic intervals• Involves frequent inserts, updates and deletes, highly volatile data, and

application-specific data• A class of systems which facilitate and manage transaction-oriented

applications, mainly data entry and retrieval transactions• Most of the systems that are used in the day-to-day businesses are of

OLTP type such as:–Order entry– Inventory management–Railway reservation system– Payroll or production tracking

5

Page 6: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Framework of the Data Warehouse: OLTP, OLAP, ODS (continued)

OLAP: On-Line Analytical Processing• A technology that uses a multi-dimensional view of aggregate data to

provide quick access to strategic information for further analysis• Inserts, updates and deletes are periodic batc processes, non-volatile

data, and integrated and summarized data• Enables end-users to perform ad-hoc analysis of data in multiple

dimensions, thereby providing the insight and understanding they need for better decision making• Typical OLAP applications are:–Business reporting for sales, management–Budgeting and forecasting– Financial reporting

6

Page 7: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Framework of the Data Warehouse: OLTP, OLAP, ODS (continued)

ODS: Operational Data Store• A subject-oriented, integrated, volatile, current-valued data store

containing only corporate detailed data• ODS typically:– is meant only for operational systems– contains current value and near current values– contains detailed data– is meant for day-to-day decisions and operational activities

7

Page 8: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Framework of the Data Warehouse: OLTP, OLAP, ODS (continued)

8

Characteristic OLTP OLAP ODS

Used For Day-to-day transaction Information management in an enterprise

Operational activities and day-to-day decisions

Database Size Moderate Very large Moderate

Data Load Field by field Batch upload Field by field

Accessed By Operational users Executives, managers, and analysts

Analysts and operational users

Kind of Data Individual records Set of records Individual records

Type of Data Transaction Analysis Transaction and analysis

Methodology Operational requirements Evolutionary Data driven

Data Structure Detailed Highly summarized Detailed and lightly summarized

Data Organization Functional Subject-oriented Subject-oriented

Data Source Homogenous, application-centric

Heterogenous Homogenous

Data Redundancy Not redundant Managed redundancy Redundant to some extent

Page 9: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Framework of the Data Warehouse: Architecture

9

SOURCE SYSTEMS STAGING DATA STORES ANALYTICS

Legacy

Flat Files

Web

Enterprise Resource Planning (ERP)

Customer Relationship Management (CRM)

Supply Chain Management (SCM)

LANDING AREA

Source Metadata

Profile

Extract

Analyze

Cleanse

Transform

Integrate/Consolidate

ETL Metadata Technical Metadata Business Metadata

Quality

Load

Change Data Capture

Operational Data Store

Data Warehouse

Data Marts

Information on Demand

OLAP

Data Mining

Reporting

Action

Analysis

Extract, Transform, Load (ETL)

Page 10: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Source SystemsThe source systems of a data warehouse can be legacy data sources, ERP’s, simple flat files to complex SAP sources, or COBOL sources or other data sources like RDBMS, AS/400, web application data, etc. Commonly, these sources or operational data(OLTP data sources) are known as transactional data sources.•Gets input from: individual live application’s data•Tasks done here: application-specific transactional, point-in-time data load•Sends output to: either landing area or staging area

Landing AreaThe landing area is a volatile intermediate area for operational data before transformation takes place. This is implemented to insulate the OLTP systems from the developer, avoid access on load on online source system applications, and to abide by federal laws in some cases. Source system data is either pushed or pulled into the landing area in a pre-determined format from respective source systems. This data is then loaded into the staging area. In some scenarios it’s quite possible that data can directly be sourced from the source system area to the staging area instead of routing itthrough the landing area.•Gets input from: individual live application’s data in a pre-determined format•Tasks done here: application-specific transactional, point-in-time data load•Sends output to: staging area

Framework of the Data Warehouse: Architecture (continued)

10

SOURCE SYSTEMS

Legacy

Flat Files

Web

Enterprise Resource Planning (ERP)

Customer Relationship Management (CRM)

Supply Chain Management (SCM)

LANDING AREA

Source Metadata

Page 11: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Staging Area•Gets input from: landing area or individual source systems•Task done here: extraction, cleansing, transformation, integration, standardization of disparate source systems data to generate a complete and conformed record•Sends output to: volatile, integrated, point-in-time data moved to either operational data store, data warehouse, or data marts

Staging AreaThe staging area is a place where you hold temporary tables on the data warehouse server. We basically need a staging area to hold the data and perform data cleansing and merging, before loading the data into the warehouse. Sometimes, the staging area is also required to hold a subset of the source for data profiling activities.

Data quality (information quality) is defined as standardizing and consolidating customer and/or business data. By cleansing, enhancing, merging, scrubbing the data and combining/aggregation related records to avoid duplicate entries, you are able to create a single record view. The staging area can also hold reference and standardization tables.

Framework of the Data Warehouse: Architecture (continued)

11

STAGING

Profile

Extract

Analyze

Cleanse

Transform

Integrate/Consolidate

ETL Metadata

Quality

Load

Change Data Capture

Extract, Transform, Load (ETL)

Page 12: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Change Data CaptureChange Data Capture (CDC) is a set of software design patterns used todetermine the data that has changed in a database so that action can be taken on the changed data.

CDC solutions occurmostly in data warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse.

It can be in source, inlanding. or in staging area.

Extract, Transform, LoadExtract: extracts data from either landing area or directly from source systems using ETL tools preferably or using custom scripts.

Transform: transformation would involve the following:1.Analyze the Data2.Profile the Data (optional, required for data quality)3.Cleanse the Data4.Integrate the Data5.Standardize the Data6.Data Quality

Load: load integrated, complete and conformed system of record into either operational data store, data warehouse, or data marts.

Framework of the Data Warehouse: Architecture (continued)

12

STAGING

Profile

Extract

Analyze

Cleanse

Transform

Integrate/Consolidate

ETL Metadata

Quality

Load

Change Data Capture

Extract, Transform, Load (ETL)

Page 13: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Change Data Capture•Gets input from: source system’s databases, or ETL metadata repository, or log or journal entries in databases either on source, staging, or landing•Tasks done here: verification of records that have been either inserted (new), updated (modified), or deleted (removed)•Sends output to: perform necessary updates and deletes on target systems

Extract, Transform, Load•Gets input from: landing area or individual source systems or staging area•Tasks done here: extraction, cleansing, transformation, integration, standardization of disparate source systems data to generate a complete and conformed record; generates ETL medata•Sends output to: integrated, complete and conformed record moved to either operational data store, data warehouse, or data marts

Framework of the Data Warehouse: Architecture (continued)

13

STAGING

Profile

Extract

Analyze

Cleanse

Transform

Integrate/Consolidate

ETL Metadata

Quality

Load

Change Data Capture

Extract, Transform, Load (ETL)

Page 14: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

ETL: Transform1.Analyze the Data2.Profile the Data3.Cleanse the Data4.Integrate the Data5.Standardize the Data6.Data Quality

Analyze the DataThis involves analysis of metadata and data values, and detection of differences between defined and inferred properties.

Profile the DataData profiling is a process to assess current data conditions, or to monitor data quality over time. It begins with collecting measurements about data, and then looking at the resultsindividually and in various combinations to see where anomalies exits.

Cleanse the DataData cleansing (also referred to as data scrubbing) is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. The term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying or deleting this dirty data.

Framework of the Data Warehouse: Architecture (continued)

14

STAGING

Profile

Extract

Analyze

Cleanse

Transform

Integrate/Consolidate

ETL Metadata

Quality

Load

Change Data Capture

Extract, Transform, Load (ETL)

Page 15: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

ETL: Transform1.Analyze the Data2.Profile the Data3.Cleanse the Data4.Integrate the Data5.Standardize the Data6.Data Quality

Integrate the DataThis involves integration and consolidation of data from various source systems to form a single system of record. Essentially, to understand the completelifecycle of this product means to integrate these different records for these different processes into a single system of record.

Standardize the DataData standardization transforms different input formats into a consolidated output format; helps in:creating single domain fields, incorporating business, industry standards.

Data QualityWithout accurate data, users loose confidence in the data and make improper decisions. Data quality addresses issues like:•Business Rules violations, e.g., missing data, use of default (1, or 0, or 9999, or ?), data with logic embedded (e.g., item code starts with 1, product code starts with 9)•Data Integrity violations, e.g., duplicate primary key, one entity have different key identifiers, no reference data, multiple variation of same value

Framework of the Data Warehouse: Architecture (continued)

15

STAGING

Profile

Extract

Analyze

Cleanse

Transform

Integrate/Consolidate

ETL Metadata

Quality

Load

Change Data Capture

Extract, Transform, Load (ETL)

Page 16: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Operational Data StoreIs a subject-oriented, integrated, volatile, current-valued, detailed-only collection of data in support of an organization's need for up-to-the second, operational, integrated, collective information•Gets input from: staging area•Tasks done here: data storage for current period, alters key structures, reformats data, lightly summarizes data, recalculates data, queried by analysis•Sends output to: data warehouse and/or data marts

AudienceData Analysts

Data ModelEntity-Relationship (normalized), detailed and lightly summarized

Database SizeModerate

Data UpdateField by field

Philosophy Support day to day decisions and operational activities

Framework of the Data Warehouse: Architecture (continued)

16

DATA STORES

Technical Metadata

Operational Data Store

Data Warehouse

Data Marts

Page 17: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Data WarehouseIs a a subject oriented, integrated, non-volatile, time-variant, collection of data organized to support management needs•Gets input from: staging area or operational data stores•Tasks done here: data storage for historical period, alters key structures, reformats data, summarizes data, recalculates data•Sends output to: data marts

AudienceManager and analysts

Data ModelDimensional and summarized

Database SizeLarge to very large

Data UpdateBatch, controlled

Philosophy Support managing the enterprise

Framework of the Data Warehouse: Architecture (continued)

17

DATA STORES

Technical Metadata

Operational Data Store

Data Warehouse

Data Marts

Page 18: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Data MartIs a body of decision-support data for a department that has an architectural foundation of a data warehouse; can also represent abusiness process that can proliferate across manydepartments•Gets input from: staging area, operational data stores, or data warehouse•Tasks done here: summarization, key allocation, aggregation, de-normalization•Sends output to: analytics and business intelligence tools use this data for reporting and data mining

AudienceExecutives, manager and analysts

Data ModelDimensional and summarized

Database SizeModerate to large

Data UpdateBatch, controlled

Philosophy Operational efficiency

Framework of the Data Warehouse: Architecture (continued)

18

DATA STORES

Technical Metadata

Operational Data Store

Data Warehouse

Data Marts

Page 19: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

AnalyticsAnalytics is defined as the extensive use of data, statistical and quantitative analyses, explanatory and predictive modeling, and fact-based management to drive decision making.

There are three types of analytics:•Descriptive analytics provides information about the past state or performance of a business and its environment. It provides regular reports for events that already happened and ad hoc reports to help examine facts about what happened, where, how often, and with how many. •Predictive analytics helps predict (based on data and statistical techniques) with confidence what will happen next so that you can make well-informed decisions and improve business outcomes. It uses simulation models to suggest what could happen.•Prescriptive analytics recommends high-value alternative actions or decisions given a complex set of targets, limits, and choices. It predicts future outcomes and suggests courses of actions to take so that you can benefit from those predictions.

Framework of the Data Warehouse: Architecture (continued)

19

ANALYTICS

Business Metadata

Information on Demand

OLAP

Data Mining

Reporting

Action

Analysis

Page 20: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Metadata“Two contractors are assigned a task of building a bridge. One is to start building from East end and the other is to start building from the West end. Both have to meet in the center and then merge. When they arrived at the center point, one end of the bridge was higher than the other by a few inches. This was because one group of contractors and their engineers used kilograms and meters, while another used pounds and feet. It caused the parent company losses in billions. Reason: it wasn’t the data that was faulty; it was the metadata.”

Metadata is “data about data.” It refers to data that tries to describe a data set in terms of its value, content, quality, and significance.It provides insight into data for information like:1.What kind of data?2.Who is the owner of this data?3.How was the data created?4.What are the attributes and significance of the data created or collected?

Framework of the Data Warehouse: Architecture (continued)

20

Source Metadata ETL Metadata Technical Metadata Business Metadata

Page 21: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Framework of the Data Warehouse: Architecture (continued)

21

SOURCE SYSTEMS STAGING DATA STORES ANALYTICS

Legacy

Flat Files

Web

Enterprise Resource Planning (ERP)

Customer Relationship Management (CRM)

Supply Chain Management (SCM)

LANDING AREA

Source Metadata

Profile

Extract

Analyze

Cleanse

Transform

Integrate/Consolidate

ETL Metadata Technical Metadata Business Metadata

Quality

Load

Change Data Capture

Operational Data Store

Data Warehouse

Data Marts

Information on Demand

OLAP

Data Mining

Reporting

Action

Analysis

Extract, Transform, Load (ETL)

Page 22: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

Data Warehouse Options There are, perhaps, as many ways to develop data warehouses as there

are organization. Moreover, there are a number of key factors that need to be considered: scope, data redundancy, and type of end-user.• Scope. The scope of a data warehouse may be as broad as all the

informational data for the entire enterprise from the beginning of time, or it may be as narrow as a personal data warehouse for a single manager for a single year.• Data Redundancy. Virtual data warehouses allow end users to get at

operational databases directly; it provides the ultimate in flexibility as well as the minimum amount of redundant data that must be loaded and maintained. Central data warehouses are single physical databases that contains all data for a specific functional area, department, division, or enterprise. Distributed data warehouses are those in which certain components are distributed across a number of different physical databases.• Type of End-User. End-users can be broadly categorized into three: executives

and managers, power users (business and financial analysts, engineers), and support users (clerical, administrative).

22

Page 23: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

For the Next SessionBAFEDM2: Fundamentals of Enterprise Data Management

23

Page 24: MELJUN CORTES Fundamentals of Enterprise Data Management Week 02

© 2013 IBM CorporationIBM Confidential

For the Next Sessions

Agenda• Module 2: Data Warehouse Design Considerations–Data Models– The Dimensional Model– Facts and Dimensions– Four-Step Dimensional Design Process–Case Study: Retail

24