33
Data Profiling using CA ERwin Modeling to assure data and metadata

Using ca e rwin modeling to asure data 09162010

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Using ca e rwin modeling to asure data 09162010

Data Profilingusing CA ERwin Modeling to assure data and metadata

Page 2: Using ca e rwin modeling to asure data 09162010

PAGE 2

abstract

• This session explores the use of data profiling to increase the accuracy of critical data assets and their associated data models/metadata. This presentation will include examples of how clients have leveraged data profiling in combination with data modeling for master data management, data warehousing, data governance, and other data intensive initiatives.

Page 3: Using ca e rwin modeling to asure data 09162010

PAGE 3

biography

• Antonio C. AmorinPresident, Data Innovations, Inc.– Eighteen years of data modeling experience and fourteen years of

experience using CA ERwin® Data Modeler

– Ten years of data profiling experience and two years of experience using CA ERwin® Data Profiler

– Data Innovations, Inc. – CA Partner since 2004

– Presented at CA World’08, CBI’s Life Sciences Forum on “Customer Data Quality and Integrity”, ERwin User Groups, webcasts and at client sites

– Graduated from Illinois State University with a BA in Computer Science and a minor in Economics

Page 4: Using ca e rwin modeling to asure data 09162010

PAGE 4

agenda

• Data Profiling

• Data and Metadata Quality

• Data Governance and Data Warehousing

• Real-life Examples

• Summary

Page 5: Using ca e rwin modeling to asure data 09162010

PAGE 5

data profiling

Page 6: Using ca e rwin modeling to asure data 09162010

PAGE 6

data profiling

• What is data profiling?– The analysis of data content to infer metadata

– A component of data modeling

• What are the basic components of the CA ERwin® Data Profiler?– Column analysis

– PF key analysis

– Data object analysis

– Overlap analysis

Page 7: Using ca e rwin modeling to asure data 09162010

PAGE 7

data profiling

• Column analysis

– Inferred metadata provides intimate knowledge of the data content at the column level

• Cardinality

• Range

• Mode

• Sparse

• Null count

Page 8: Using ca e rwin modeling to asure data 09162010

PAGE 8

data profiling

• Column analysis (continued)• Value frequencies

• Pattern frequencies

• Length frequencies

• Identify critical data elements

– Allows the user to focus analysis on specific attributes

Page 9: Using ca e rwin modeling to asure data 09162010

PAGE 9

data profiling

• PF key analysis

– Cross-table analysis of primary-foreign key relationships

• Column matches

• Classification

– Parent-child

– Reference

– None

Page 10: Using ca e rwin modeling to asure data 09162010

PAGE 10

data profiling

• PF key analysis (continued)

– Cross-table analysis of primary-foreign key relationships

• Expressions

– table.column=table.column

• Row hit rate

• Value hit rate

• Selectivity

Page 11: Using ca e rwin modeling to asure data 09162010

PAGE 11

data profiling

• Data objects

– Similar to subject areas

– Groups objects together that contain the same data content

– Based on the parent-child relationships

– Creates an object view of related tables or files

Page 12: Using ca e rwin modeling to asure data 09162010

PAGE 12

data profiling

• Overlap analysis

– Cross-system analysis that identifies data content overlap

– Data Set Summary

• Provides graphical overview

– Legend identifies color coded data sources

– Allows modeler to visualize data content overlap between data sources

Page 13: Using ca e rwin modeling to asure data 09162010

PAGE 13

data profiling

• Overlap analysis (continued)

– Data set overlaps

• Table compares each data source to the other data sources

• Allows comparison of two data sources at a time

• Identifies the number of tables and columns that overlap between each data source

Page 14: Using ca e rwin modeling to asure data 09162010

PAGE 14

data profiling

• Overlap analysis (continued)– Column Summary

• Identifies each column in the primary data source

• Identifies value overlap between data sources

• Allows modeler to use critical data elements to focus analysis

• Allows modeler to drill into analysis to identify data content overlap

Page 15: Using ca e rwin modeling to asure data 09162010

PAGE 15

data profiling

• Overlap analysis (continued)– Matches data preview

• Allows the modeler to view hits or misses

• Identifies specific data content that overlaps or does not overlap between each data source

Page 16: Using ca e rwin modeling to asure data 09162010

PAGE 16

data and metadata quality

Page 17: Using ca e rwin modeling to asure data 09162010

PAGE 17

data and metadata quality

• Data– Business data - information utilized to operate the business

• Metadata– Information generated during the development of IT solutions

– Defines both the business and technical understanding of the data

– Utilized to store, process, and report on business data

Page 18: Using ca e rwin modeling to asure data 09162010

PAGE 18

data and metadata quality

• Data Quality– Accuracy of the business data

– High/low quality

– Mission critical

• Metadata quality– Properly represents data content

– Validate parent-child relationships

Page 19: Using ca e rwin modeling to asure data 09162010

PAGE 19

data and metadata quality

• Leveraging data profiling– Use the cardinality, range, mode, and sparse indicators to identify attributes

requiring detailed analysis

– Identify data quality issues and validate data types using the value and pattern frequencies

– Leverage the null count and length frequencies to validate column metadata

– Validate parent-child relationships using the primary-foreign key analysis

– Leverage the overlap analysis with reference tables containing valid values for data quality assessments

Page 20: Using ca e rwin modeling to asure data 09162010

PAGE 20

data governance and data warehousing

Page 21: Using ca e rwin modeling to asure data 09162010

PAGE 21

data governance and data warehousing

Leveraging data profiling for data governance

• Business Data– Standards

– Master data management

– Data quality assessments

• Metadata– Standards

– Model validation

Page 22: Using ca e rwin modeling to asure data 09162010

PAGE 22

data governance and data warehousing

Leveraging data profiling for data governance (continued)

• Standards– Business data - valid values, data patterns, and standardized values for static

data content

– Metadata – validate model metadata represents data content properly and validate parent-child relationships

– Automate the analysis with profiling

– Develop profiling reports for each standard

– Define and implement a review process

– Integrate standards and review process into SDLC

Page 23: Using ca e rwin modeling to asure data 09162010

PAGE 23

data governance and data warehousing

Leveraging data profiling for data governance (continued)

• Master data management (MDM)– Locating reference data

– Data mapping

– Harmonizing reference data

– Establishing validations and syndication rules

– Identifying hub metadata

– Data quality assessments

Page 24: Using ca e rwin modeling to asure data 09162010

PAGE 24

data governance and data warehousing

Leveraging data profiling for data governance (continued)

• Data quality assessments– Comprehensive review at the column level

– Validation of primary keys

– Validation of parent-child relationships

– Point-to-point content validation between systems

– Standardize analysis methodology

– Standardize problem notation

– Standardize reporting

Page 25: Using ca e rwin modeling to asure data 09162010

PAGE 25

data governance and data warehousing

Leveraging data profiling for data warehousing

• Data warehouse development– Leverage data models and data profiling results to locate and map business

data to the data warehouse

– Eliminate the code-load-explode development methodology for ETL

• Profile each data source to validate data content

• Identify accurate requirements for transformations to consolidate data content and correct data quality issues

– Use profiling results to determine model metadata for target staging databases and the data warehouse

– Profile the data warehouse regularly to ensure high quality data content

Page 26: Using ca e rwin modeling to asure data 09162010

PAGE 26

real-life examples

Page 27: Using ca e rwin modeling to asure data 09162010

PAGE 27

real-life examples

Public computer hardware and software manufacturer

• Introduced data profiling into ongoing data warehousing project

– Profiled first data source• Found questionable data content in financial data within ten minutes of

profiling data

• Realized that six months were wasted mapping from the data source to the target data warehouse

• All new data sources were profiled going forward to ensure validity

Page 28: Using ca e rwin modeling to asure data 09162010

PAGE 28

real-life examples

Large public food manufacturer

• Introduced data profiling into sales data warehouse project– Leveraged data profiling results to create accurate ETL specifications,

reducing the overall development time

– Developers utilized data profiling to validate ETL unit testing

– Used cross-system analysis to integrate data content from disparate data sources into standardized values in data warehouse

– Profiled data warehouse regularly to identify data content issues

Page 29: Using ca e rwin modeling to asure data 09162010

PAGE 29

real-life examples

Public healthcare insurance provider

• Introduced data profiling into ongoing master data management project– Performed data content mapping utilizing profiling results

– Analyzed IMS extracts and flat files to determine where reference data lived within legacy mainframe data sources

– Leveraged profiling results to create ETL specifications

– Harmonized reference data using profiling results

– Validated reference data loaded into MDM hub

Page 30: Using ca e rwin modeling to asure data 09162010

PAGE 30

real-life examples

Medium-sized accounting service organization

• Created data store for reporting purposes– Profiled disparate data sources to identify model metadata for new data

store

– Leveraged profiling results to identify data quality issues for each data source

– Created ETL specifications to consolidate data content from the disparate data sources using the profiling results

– Validated data content in the loaded data store

Page 31: Using ca e rwin modeling to asure data 09162010

PAGE 31

summary

• Data Profiling

• Increases accuracy of data content and metadata

• Reduces project overrun

• Increases value of deliverables to the business

• Valuable for master data management, data warehousing, data governance, and other data intensive initiatives

Page 32: Using ca e rwin modeling to asure data 09162010

questions and answers

Page 33: Using ca e rwin modeling to asure data 09162010

thank you