Upload
dataversity
View
967
Download
0
Embed Size (px)
Citation preview
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Data Quality Engineering
Date: October 9, 2012Time: 2:00 PM ETPresented by: Dr. Peter Aiken
1
This presentation provides guidance to organizations considering data quality initiatives or preparing for data quality initiatives. This talk will illustrate how organizations with chronic business challenges often can trace the root of the problem to poor data quality. Showing how data quality can be engineered provides a useful framework in which to develop an organizational approach. This in turn will allow organizations to more quickly identify data problems caused by structural issues versus practice-oriented defects. Participants will also learn the importance of practicing data quality engineering quantification.
Startingpointfor newsystemdevelopment
data performance metadata
data architecture
dataarchitecture and
data models
shared data updated data
correcteddata
architecturerefinements
facts &meanings
Metadata &Data Storage
Starting pointfor existingsystems
Metadata Refinement• Correct Structural Defects• Update Implementation
Metadata Creation• Define Data Architecture• Define Data Model Structures
Metadata Structuring• Implement Data Model Views• Populate Data Model Views
Data Refinement• Correct Data Value Defects• Re-store Data Values
Data Manipulation• Manipulate Data• Updata Data
Data Utilization• Inspect Data• Present Data
Data Creation• Create Data• Verify Data Values
Data Assessment• Assess Data Values• Assess Metadata
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Get Social With Us!
Live Twitter FeedJoin the conversation!
Follow us: @datablueprint
@paikenAsk questions and submit your comments: #dataed
2
Like Us on Facebookwww.facebook.com/
datablueprint Post questions and
commentsFind industry news, insightful
content and event updates.
Join the GroupData Management &
Business IntelligenceAsk questions, gain insights and collaborate with fellow
data management professionals
- datablueprint.com 10/11/2012 © Copyright this and previous years by Data Blueprint - all rights reserved!
Meet Your Presenter: Dr. Peter Aiken
• Internationally recognized thought-leader in the data management field - 30 years of experience– Recipient of multiple international
awards– Founder, Data Blueprint
(http://datablueprint.com)• 7 books and dozens of articles• Experienced w/ 500+ data
management practices in 20 countries
• Multi-year immersions with organizations as diverse as the US DoD, Deutsche Bank, Nokia, Wells Fargo, the Commonwealth of Virginia and Walmart
3
10/09/12DATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060 EDUCATION
Data Quality Engineering
Data Quality Engineering
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Outline
5
1. Data Management Introduction
2. Data Quality Definitions & Overview
3. DQM Cycle
4. DQ Awareness & Requirements
5. DQ Dimensions
6. Data Quality Tools
7. Guiding Principles
8. References and Q&ATweeting now:
#dataed
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
The DAMA Guide to the Data Management Body of Knowledge
6
Data Management
Functions
Published by DAMA International• The professional
association for Data Managers (40 chapters worldwide)
DMBoK organized around • Primary data
management functions focused around data delivery to the organization
• Organized around several environmental elements
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
The DAMA Guide to the Data Management Body of Knowledge
7
Environmental Elements
Amazon:http://www.amazon.com/DAMA-Guide-Management-Knowledge-DAMA-DMBOK/dp/0977140083Or enter the terms "dama dm bok" at the Amazon search engine
© Copyright this and previous years by Data Blueprint - all rights reserved!
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATIONEDUCATION
DATE SLIDE5/15/2012
What is the CDMP?• Certified Data Management
Professional• DAMA International and ICCP• Membership in a distinct group made
up of your fellow professionals• Recognition for your specialized
knowledge in a choice of 17 specialty areas
• Series of 3 exams• For more information, please visit:
– http://www.dama.org/i4a/pages/index.cfm?pageid=3399
– http://iccp.org/certification/designations/cdmp
8
#dataed
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Data Management
91/26/2010 © Copyright this and previous years by Data Blueprint - all rights reserved!
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Data Management
10
Manage data coherently.
Share data across boundaries.
Assign responsibilities for data.Engineer data delivery systems.
Maintain data availability.
Data Program Coordination
Organizational Data Integration
Data Stewardship Data Development
Data Support Operations
1/26/2010 © Copyright this and previous years by Data Blueprint - all rights reserved!
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Data Management
11
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Overview: Data Quality Engineering
12
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
1/26/2010 © Copyright this and previous years by Data Blueprint - all rights reserved!
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Overview: Data Quality Engineering
13
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
1/26/2010 © Copyright this and previous years by Data Blueprint - all rights reserved!
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Outline
14
1. Data Management Introduction
2. Data Quality Definitions & Overview
3. DQM Cycle
4. DQ Awareness & Requirements
5. DQ Dimensions
6. Data Quality Tools
7. Guiding Principles
8. References and Q&ATweeting now:
#dataed
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
DefinitionsData Quality Management• Planning, implementation and control activities that
apply quality management techniques to measure, assess, improve, and ensure the fitness of data for use
• Entails the establishment and deployment of roles, responsibilities concerning the acquisition, maintenance, dissemination, and disposition of data.” http://www2.sas.com/proceedings/sugi29/098-29.pdf
15
• Critical support process in organizational change management• Continuous process for defining the parameters for specifying
acceptable levels of data quality to meet business needs and for ensuring that data quality meets these levels
Data Quality • Synonymous with information quality, since poor data quality results
in inaccurate information and poor business performance from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
10/09/2012
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Overview: DQM Concepts and Activities1) Data Quality Management Approach2) Develop and promote data quality awareness3) Define data quality requirements4) Profile, analyze and assess data quality5) Define data quality metrics6) Define data quality business rules7) Test and validate data quality requirements8) Set and evaluate data quality service levels9) Measure and monitor data quality10) Manage data quality issues11) Clean and correct data quality defects12) Design and implement operational DQM procedures13) Monitor operational DQM procedures and performance
16
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Concepts and Activities Data quality expectations provide the inputs
necessary to define the data quality framework:– Requirements– Inspection policies– Measures, and monitors that reflect changes in data
quality and performance• The data quality framework requirements reflect 3
aspects of business data expectations1) A manner to record the expectation in business rules2) A way to measure the quality of data within that
dimension 3) An acceptability threshold
17
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Outline
18
1. Data Management Introduction
2. Data Quality Definitions & Overview
3. DQM Cycle
4. DQ Awareness & Requirements
5. DQ Dimensions
6. Data Quality Tools
7. Guiding Principles
8. References and Q&ATweeting now:
#dataed
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
The DQM CycleThe general approach to DQM is a version of the Deming cycle.
Deming proposes a problem–solving model known as “plan-do-study-act” or “plan-do-check-act”
The cycle begins by:1) Identifying data issues that are
critical to the achievement of business objectives
19
2) Defining business requirements for data quality3) Identifying key data quality dimensions4) Defining business rules critical to ensuring high quality data
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
The DQM Cycle: (1) PlanPlan for the assessment of the current state and identification of key metrics for measuring quality• The data quality team
assesses the scope of known issues
• This involves:– Determining cost and
impact– Evaluating alternatives for
addressing them
20
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
The DQM Cycle: (2) Deploy
21
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
Deploy processes for measuring and improving the quality of data:• Data profiling• Institute inspections and
monitors to identify data issues when they occur
• Fix flawed processes that are the root cause of data errors or correct errors downstream
• When it is not possible to correct errors at their source, correct them at their earliest point in the data flow
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
The DQM Cycle: (3) MonitorMonitor the quality of data as measured against the defined business rules• If data quality meets defined
thresholds for acceptability, the processes are in control and the level of data quality meets the business requirements
• If data quality falls below acceptability thresholds, notify data stewards so they can take action during the next stage
22
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
The DQM Cycle: (4) ActAct to resolve any
identified issues to improve data quality and better meet business expectations
• New cycles begin as new data sets come under investigation or as new data quality requirements are identified for existing data sets
23
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Outline
24
1. Data Management Introduction
2. Data Quality Definitions & Overview
3. DQM Cycle
4. DQ Awareness & Requirements
5. DQ Dimensions
6. Data Quality Tools
7. Guiding Principles
8. References and Q&ATweeting now:
#dataed
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Develop and Promote DQ Awareness• Promoting data quality awareness is
essential to ensure buy-in of necessary stakeholders in the organization
• Ensure that the right people in the organization are aware of the existence of data quality issues
• Awareness increases the chance of success of any DQM program
• Awareness includes:– Relating material impacts to data issues– Ensuring systematic approaches to
regulators– Oversight of the quality of organizational
data– Socializing the concept that data quality
problems cannot be solely addressed by technology solutions
25
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Polling Question #1
26
Which is not a step to promote data quality awareness?
a) Training on the core concepts of data quality
b) Establish data governance framework for data quality
c) Create a data architecture map
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Develop and Promote DQ Awareness: Steps1) Training on the core
concepts of data quality
2) Establish data governance framework for data quality
3) Create a data quality oversight board that has a reporting hierarchy associated with the different data governance roles
27
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Define DQ Requirements• Data quality must be understood within the context of ‘fitness for
use’• Data quality requirements are often hidden within defined
business policies• Incremental detailed review and iterative refinement of business
policies helps to identify those information requirements which become data quality rules
• Steps for incremental detailed review:– Identify key data components associated with business policies– Determine how identified data assertions affect the business– Evaluate how data errors are categorized within a set of data quality
dimensions– Specify the business rules that measure the occurrence of data
errors– Provide a means for implementing measurement processes that
assess conformance to those business rules
28
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Data Quality Dimensions
29
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Profile, Analyze and Assess DQData assessment using 2 different approaches:
1) Bottom-up2) Top-down
Bottom-up assessment:• Inspection and evaluation of the data sets• Highlight potential issues based on the results of automated
processes
Top-down assessment:• Engage business users to document their business processes
and the corresponding critical data dependencies• Understand how their processes consume data and which
data elements are critical to the success of the business application
30
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Define DQ Metrics• Metrics development occurs as part of the
strategy/design/plan step • Process for defining data quality metrics:
1) Select one of the identified critical business impacts2) Evaluate the dependent data elements, create and
update processes associate with that business impact
3) List any associated data requirements4) Specify the associated dimension of data quality and
one or more business rules to use to determine conformance of the data to expectations
5) Describe the process for measuring conformance6) Specify an acceptability threshold
31
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Test and Validate DQ Requirements• Data profiling tools
analyze data to find potential anomalies
• Use the same tools for rule validation
• Rules discovered or defined during the data quality assessment phase are referenced in measuring conformance as part of the operational process
32
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Set and Evaluate DQ Service Levels• Data quality inspection and monitoring are used to
measure and monitor compliance with defined data quality rules
• Data quality SLAs specify the organization’s expectations for response and remediation
• Operational data quality control defined in data quality SLAs includes:– Data elements covered by the agreement– Business impacts associated with data flaws– Data quality dimensions associated with each data element– Quality expectations for each data element of the indentified
dimensions in each application for system in the value chain– Methods for measuring against those expectations– (…)
33
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Measure and Monitor DQ• DQM procedures depend on available data
quality measuring and monitoring services • 2 contexts for control/measurement of
conformance to data quality business rules exist:– In-stream: collect in-stream measurements while
creating data– In batch: perform batch activities on collections of data
instances assembled in a data set
• Apply measurements at 3 levels of granularity:– Data element value– Data instance or record– Data set
34
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Manage DQ Issues
35
Clean & Correct DQ Defects
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
• Supporting the enforcement of the data quality SLA requires a mechanism for reporting and tracking data quality incidents and activities for researching and resolving those incidents
• A data quality incident reporting system can provide this capability
• It can log the evaluation, initial diagnosis, and actions associated with data quality events
Perform data correction in 3 ways:1) Automated correction2) Manual directed correction3) Manual correction
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Manage DQ Issues: Example
36
Data quality incident tracking focuses on training staff to recognize
when data issues appear and how they are to be classified, logged and tracked according to the data quality SLA
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Design and Implement Operational DQM
Procedures
37
Monitor Operational DQM Procedures and
Performances1) Inspection and monitoring2) Diagnosis and evaluation
of remediation alternatives
3) Resolve issues4) Reporting
1) Accountability is critical to governance protocols overseeing data quality control
2) All issues must be assigned
3) The tracking process should specify and document the ultimate issue accountability
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Outline
38
1. Data Management Introduction
2. Data Quality Definitions & Overview
3. DQM Cycle
4. DQ Awareness & Requirements
5. DQ Dimensions
6. Data Quality Tools
7. Guiding Principles
8. References and Q&ATweeting now:
#dataed
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Example: Data Quality Interview Session Summary
• During mid-February, the Data Governance Team and Data Blueprint conducted ten qualitative interview sessions with groups of individuals who interact with data on regular basis
• A series of patterns emerged as participants shared stories about the impact of poor data quality on the client, its products, and its customers
• These patterns highlight gaps in best practices for ensuring data quality, i.e. the extent to which data is “fit for use”
• Our preliminary analysis evaluated these stories against attributes of four data quality dimensions
• At this early stage of the post-interview process, we are seeking confirmation of our assumptions and method
39
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Which Activities Support Quality Data?
40
• Data quality best practices depend on both– Practice-oriented activities– Structure-oriented activities
Practice-oriented activities focus on the capture and manipulation of data
Structure-oriented activities focus on the data implementation
Quality Data
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Quality Dimensions
41
Practice-oriented causes • Stem from a failure to rigor when
capturing and manipulating data such as:– Edit masking– Range checking of input data– CRC-checking of transmitted data
Structure-oriented causes• Occur because of data and metadata that has been arranged
imperfectly. For example: – When the data is in the system but we just can't access it; – When a correct data value is provided as the wrong response to
a query; or – When data is not provided because it is unavailable or
inaccessible to the customer• Developer focus within system boundaries instead of within
organization boundaries
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Practice-Oriented Activities
42
• Affect the Data Value Quality and Data Representation Quality
• Examples of improper practice-oriented activities:– Allowing imprecise or incorrect data to be collected when
requirements specify otherwise– Presenting data out of sequence
• Typically diagnosed in bottom-up manner: find and fix the resulting problem
• Addressed by imposing more rigorous data-handling governance
Quality of Data Representa2on
Quality of Data Values
Practice-oriented activities
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Structure-Oriented Activities
43
• Affect the Data Model Quality and Data Architecture Quality• Examples of improper structure-oriented activities:
– Providing a correct response but incomplete data to a query because the user did not comprehend the system data structure
– Costly maintenance of inconsistent data used by redundant systems
• Typically diagnosed in top-down manner: root cause fixes• Addressed through fundamental data structure governance
Quality of Data Architecture
Quality of Data Models
Structure-oriented activities
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
4 Dimensions of Data Quality
44
An organization’s overall data quality is a function of four distinct components, each with its own attributes:
• Data Value: the quality of data as stored & maintained in the system
• Data Representation – the quality of representation for stored values; perfect data values stored in a system that are inappropriately represented can be harmful
• Data Model – the quality of data logically representing user requirements related to data entities, associated attributes, and their relationships; essential for effective communication among data suppliers and consumers
• Data Architecture – the coordination of data management activities in cross-functional system development and operations
Pra
ctic
e-or
ient
edStructure-‐
orie
nted
10/09/2012
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Effective Data Quality Engineering
45
• Data quality engineering has been focused on operational problem correction– Directing attention to practice-oriented data imperfections
• Data quality engineering is more effective when also focused on structure-oriented causes– Ensuring the quality of shared data across system
boundaries
Data Representa9on Quality
As presented to the user
Data Value Quality
As maintained in the system
Data Model Quality
As understood by developers
Data Architecture Quality
As an organiza9onal asset
(closer to the architect)(closer to the user)
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Full Set of Data Quality Attributes
46
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Data Value Quality
47
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Data Representation Quality
48
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Data Model Quality
49
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Data Architecture Quality
50
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Startingpointfor newsystemdevelopment
data performance metadata
data architecture
dataarchitecture and
data models
shared data updated data
correcteddata
architecturerefinements
facts &meanings
Metadata &Data Storage
Starting pointfor existingsystems
Metadata Refinement• Correct Structural Defects• Update Implementation
Metadata Creation• Define Data Architecture• Define Data Model Structures
Metadata Structuring• Implement Data Model Views• Populate Data Model Views
Data Refinement• Correct Data Value Defects• Re-store Data Values
Data Manipulation• Manipulate Data• Updata Data
Data Utilization• Inspect Data• Present Data
Data Creation• Create Data• Verify Data Values
Data Assessment• Assess Data Values• Assess Metadata
Extended data life cycle model with metadata sources and uses
51
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Data Quality Engineering
52
üü ü üü ü ü
üü ü üü ü ü
üü ü üü ü ü
üü ü üü ü ü
üü ü üü ü üüü ü üü ü ü
üü ü üü ü ü
1/26/2010 © Copyright this and previous years by Data Blueprint - all rights reserved!
üü ü üü ü ü
üü ü üü ü ü
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Goals and Principles
data quality control into the system development life cycle§ To provide defined processes for measuring,
monitoring, and reporting conformance to acceptable levels of data quality
53
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
1/26/2010 © Copyright this and previous years by Data Blueprint - all rights reserved!
§ To measurably improve the quality of data in relation to defined business expectations
§ To define requirements and specifications for integrating
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Activities• Develop and Promote Data Quality Awareness• Set and Evaluate Data Quality Service Levels• Test and Validate Data Quality Requirements• Profile, Analyze, and Assess Data Quality• Continuously Measure and Monitor Data Quality• Monitor Operational DQM Procedures and Performance• Define Data Quality Business Rules• Define Data Quality Metrics• Manage Data Quality Issues• Clean and Correct Data Quality Defects• Define Data Quality Requirements• Design and Implement Operational DQM Procedures
54
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
1/26/2010 © Copyright this and previous years by Data Blueprint - all rights reserved!
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Primary Deliverables
• Improved Quality Data• Data Management
Operational Analysis• Data profiles• Data Quality Certification
Reports• Data Quality Service
Level Agreements
55
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
1/26/2010 © Copyright this and previous years by Data Blueprint - all rights reserved!
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Roles and Responsibilities
56
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
1/26/2010 © Copyright this and previous years by Data Blueprint - all rights reserved!
Suppliers:§ External Sources§ Regulatory Bodies§ Business Subject Matter
Experts§ Information Consumers§ Data Producers§ Data Architects§ Data Modelers§ Data Stewards
Participants:§ Data Quality Analysts§ Data Analysts§ Database Administrators§ Data Stewards§ Other Data Professionals§ DRM Director§ Data Stewardship Council
Consumers:§ Data Stewards§ Data Professionals§ Other IT Professionals§ Knowledge Workers§ Managers and
Executives§ Customers
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Polling Question #2
57
What is one guiding principle for data quality?
a. Business process owners will agree to and abide by data quality SLAs
a. IdenDfy a blue record for all data elements
a. Upstream data consumers specific data quality expectaDons
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Outline
58
1. Data Management Introduction
2. Data Quality Definitions & Overview
3. DQM Cycle
4. DQ Awareness & Requirements
5. DQ Dimensions
6. Data Quality Tools
7. Guiding Principles
8. References and Q&ATweeting now:
#dataed
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Technology• Data Profiling Tools• Statistical Analysis Tools• Data Cleansing Tools• Data Integration Tools• Issue and Event Management Tools
59
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Overview: Data Quality Tools4 categories of activities:
1) Analysis2) Cleansing3) Enhancement4) Monitoring
60
Principal tools:1) Data Profiling2) Parsing and
Standardization3) Data Transformation4) Identity Resolution and
Matching5) Enhancement6) Reporting
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
DQ Tool #1: Data Profiling• Data profiling is the assessment of
value distribution and clustering of values into domains
• Need to be able to distinguish between good and bad data before making any improvements
• Data profiling is a set of algorithms for 2 purposes:– Statistical analysis and assessment of the data quality values within a
data set
– Exploring relationships that exist between value collections within and across data sets
• At its most advanced, data profiling takes a series of prescribed rules from data quality engines. It then assesses the data, annotates and tracks violations to determine if they comprise new or inferred data quality rules
61
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
DQ Tool #1: Data Profiling, cont’d• Data profiling vs. data quality-business context and
semantic/logical layers– Data quality is concerned with proscriptive rules– Data profiling looks for patterns when rules are adhered to and
when rules are violated; able to provide input into the business context layer
• Incumbent that data profiling services notify all concerned parties of whatever is discovered
• Profiling can be used to…– …notify the help desk that valid
changes in the data are about to case an avalanche of “skeptical user” calls
– …notify business analysts of precisely where they should be working today in terms of shifts in the data
62
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
DQ Tool #2: Parsing & Standardization • Data parsing tools enable the definition
of patterns that feed into a rules engine used to distinguish between valid and invalid data values
• Actions are triggered upon matching a specific pattern
• When an invalid pattern is recognized, the application may attempt to transform the invalid value into one that meets expectations
• Data standardization is the process of conforming to a set of business rules and formats that are set up by data stewards and administrators
• Data standardization example:– Brining all the different formats of “street” into a single format, e.g.
“STR”, “ST.”, “STRT”, “STREET”, etc.
63
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
DQ Tool #3: Data Transformation• Upon identification of data
errors, trigger data rules to transform the flawed data
• Perform standardization and guide rule-based transformations by mapping data values in their original formats and patterns into a target representation
• Parsed components of a pattern are subjected to rearrangement, corrections, or any changes as directed by the rules in the knowledge base
64
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
DQ Tool #4: Identify Resolution & Matching• Data matching enables analysts to identify relationships between
records for de-duplication or group-based processing• Matching is central to maintaining data consistency and integrity
throughout the enterprise• The matching process should be used in
the initial data migration of data into a single repository
2 basic approaches to matching:• Deterministic
– Relies on defined patterns/rules for assigning weights and scores to determine similarity
– Predictable– Dependent on rules developers anticipations
• Probabilistic – Relies on statistical techniques for assessing the probability that any pair of record
represents the same entity– Not reliant on rules– Probabilities can be refined based on experience -> matchers can improve
precision as more data is analyzed
65
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
DQ Tool #5: EnhancementDefinition:• A method for adding value to
information by accumulating additional information about a base set of entities and then merging all the sets of information to provide a focused view. Improves master data.
Benefits:• Enables use of third party data
sources• Allows you to take advantage of
the information and research carried out by external data vendors to make data more meaningful and useful
Examples of data enhancements:
• Time/date stamps• Auditing information• Contextual information• Geographic information• Demographic information• Psychographic information
66
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
DQ Tool #6: Reporting• Good reporting supports:
– Inspection and monitoring of conformance to data quality expectations– Monitoring performance of data stewards conforming to data quality
SLAs– Workflow processing for data quality incidents– Manual oversight of data cleansing and correction
• Data quality tools provide dynamic reporting and monitoring capabilities
• Enables analyst and data stewards to support and drive the methodology for ongoing DQM and improvement with a single, easy-to-use solution
• Associate report results with:– Data quality measurement– Metrics– Activity
67
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Outline
68
1. Data Management Introduction
2. Data Quality Definitions & Overview
3. DQM Cycle
4. DQ Awareness & Requirements
5. DQ Dimensions
6. Data Quality Tools
7. Guiding Principles
8. References and Q&ATweeting now:
#dataed
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Guiding Principles1) Manage data as a core organizational asset. 2) Identify a gold record for all data elements3) All data elements will have a standardized data definition, data type, and
acceptable value domain4) Leverage data governance for the control and performance of DQM5) Use industry and international data standards whenever possible6) Downstream data consumers specify data quality expectations7) Define business rules to assert conformance to data quality expectations8) Validate data instances and data sets against defined business rules9) Business process owners will agree to and abide by data quality SLAs10) Apply data corrections at the original source if possible11) If it is not possible to correct data at the source, forward data corrections to
the owner of the original source. Influence on data brokers to conform to local requirements may be limited
12) Report measured levels of data quality to appropriate data stewards, business process owners, and SLA managers
69
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Interdependencies - Tools alone cannot do the job!
Data Quality Tools(Technology)
Data Cleansing and Prevention(Process)
Education and Training(People)
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Summary: Data Quality Engineering
71
from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
1/26/2010 © Copyright this and previous years by Data Blueprint - all rights reserved!
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Outline
72
1. Data Management Introduction
2. Data Quality Definitions & Overview
3. DQM Cycle
4. DQ Awareness & Requirements
5. DQ Dimensions
6. Data Quality Tools
7. Guiding Principles
8. References and Q&ATweeting now:
#dataed
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Recommended Reading
73
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Questions?
74
It’s your turn! Use the chat feature or Twitter (#dataed) to submit
your questions to Peter now.
+ =
TITLE
PRODUCED BYDATA BLUEPRINT 10124-C W. BROAD ST, GLEN ALLEN, VA 23060
CLASSIFICATION
EDUCATIONDATE SLIDE
10/09/1210/04/12 © Copyright this and previous years by Data Blueprint - all rights reserved!
Upcoming Events
75
November Webinar: Get the Most Out of Your Tools: Data Management TechnologiesNovember 13, 2012 @ 2:00 PM – 3:30 PM ET(11:00 AM-12:30 PM PT)
December Webinar:Show Me the Money: The Business Value of Data and ROIDecember 11, 2012 @ 2:00 PM – 3:30 PM ET(11:00 AM-12:30 PM PT)
Sign up here:• www.datablueprint.com/webinar-schedule • www.Dataversity.net
Brought to you by: