Upload
clementine-day
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Understanding Data Quality Issues:
Finding Data Inaccuracies
Art DeMaioEvoke SoftwareVP Technical Sales Support
Agenda
• Why is Understanding Data Important• Methodology for Assessing Data
– Defining– Weighting– Profiling– Revisiting– Finding– Addressing– Maintaining
• What is Profiling• Benefits of the Assessment
What the Experts say…
• “Information quality is not an esoteric notion;it directly affects the effectiveness and efficiency of business processes. Information quality also plays a major role in customer satisfaction.”
- Larry P. English
What the Experts say…
• “Poor data quality is costly. It lowers customer satisfaction, adds expense, and makes it more difficult to run a business and pursue tactical improvements such as data warehouses and re-engineering.”
- Thomas C. Redman
What’s in Your DATA…
• “…three-quarters (of participating companies) reported significant problems as a result of defective data, with a third failing to bill or collect receivables as a result.”
- In a PricewaterhouseCoopers survey of 600 CIOs, IT directors or similar executives
What is Data Quality?
• Accuracy of Content
• Structure
• Completeness
• Timeliness
• Presentation
Assessing Your Data
2-Weight/Impact
3-ProfileData
6-Address
Source Data7-Maintain
4-RevisitDefinitions,
Weights
5-Findings1-DefineIssues
Defining Issues
•Standard list•Key requirements
•Content•Structure•Completeness
•Update list by project or source
Source Data
1-DefineIssues
Defining Issues-sampleConstantsDefinition MismatchesFiller Containing DataInconsistent CasesInconsistent Data TypesInconsistent Null RulesInvalid KeysInvalid ValuesMiscellaneousMissing ValuesOrphansOut of RangePattern ExceptionsPotential ConstantsPotential DefaultsPotential DuplicatesPotential InvalidsPotential RedundantValuesPotential Unused FieldsRule ExceptionsUnused Fields
Source Data
1-DefineIssues
Weight Impact
•After the issues are initially identified:
• Some issues are more critical than others
• Weights are not priorities• Assign a weighting factor
(1-5)• Weighting factors
SHOULD change by project
2-Weight/Impact
Source Data
1-DefineIssues
Profile Data
•What does Data Profiling mean?
2-Weight/Impact
3-ProfileData
Source Data
1-DefineIssues
What is Data Profiling?
The use of analytical techniques on data for the purpose of developing a thorough knowledge of itscontent, structure and quality.
A process of developing information about datainstead of information from data.
Information About Data: (Data Profiling)
30% of entries in SUPPLIER_ID are blank the range of values in UNIT_PRICE is 5.99 to 4599.99 there are 14 ORDER_HEADER rows with no ORDER_DETAIL rows
Information FROM Data: (not Data Profiling)
Texas auto buyers buy more Cadillacs per capita than any other state The average mortgage amount increased last year by 6% 10% of last year's customers did not buy anything this year
What is Data Profiling?
Profile Data
•This is multi-step process• Collect documentation• Review the DATA itself• Compare data to documentation• Identify and detail specific issues
2-Weight/Impact
3-ProfileData
Source Data
1-DefineIssues
Revisit
•Review the issues and weights• Should there be more or less issues
•What are they?• Are the relative importance of
each issue different?
2-Weight/Impact
3-ProfileData
Source Data
4-RevisitDefinitions,
Weights
1-DefineIssues
Findings
•Your findings tell others about the data
• Documented reports and/or charts• Results database• Quality Assessment Score
2-Weight/Impact
3-ProfileData
Source Data
4-RevisitDefinitions,
Weights
5-Findings1-DefineIssues
Findings-Chart
Sample Company Issue Findings
0
5
10
15
20
25
Issue Category
Co
un
t o
f Is
su
es
Constant
Definition Mismatch
Filler Containing Data
Inconsistent Case
Inconsistent Data Type
Inconsistent Null Rule
Invalid Keys
Invalid Values
Miscellaneous
Missing Values
Orphans
Out of Range
Pattern Exception
Potential Constant
Potential Default
Potential Duplicates
Potential Invalid
Potential Redundant
Potential Unused
Rule Exceptions
Unused
Findings-ChartIssues Possible
Issue Type Discovered IssuesConstants 1 59Definition Mismatches 4 59Filler Containing Data 1 59Inconsistent Cases 3 59Inconsistent Data Types 15 59Inconsistent Null Rules 6 59Invalid Keys 1 3Invalid Values 1 59Miscellaneous 10 59Missing Values 18 59Orphans 2 2Out of Range 3 59Pattern Exceptions 10 59Potential Constants 1 59Potential Defaults 1 59Potential Duplicates 3 59Potential Invalids 4 59Potential RedundantValues 21 59Potential Unused Fields 1 59Rule Exceptions 3 3Unused Fields 1 59
110 1070
Raw Score 89.7%
Findings-ChartWeight Issues PossibleFactor Issue Type Discovered Issues
4 Constants 1 592 Definition Mismatches 4 593 Filler Containing Data 1 591 Inconsistent Cases 3 592 Inconsistent Data Types 15 593 Inconsistent Null Rules 6 595 Invalid Keys 1 35 Invalid Values 1 591 Miscellaneous 10 593 Missing Values 18 594 Orphans 2 25 Out of Range 3 594 Pattern Exceptions 10 592 Potential Constants 1 592 Potential Defaults 1 591 Potential Duplicates 3 593 Potential Invalids 4 594 Potential RedundantValues 21 593 Potential Unused Fields 1 595 Rule Exceptions 3 34 Unused Fields 1 59
110 1070
Weighted Score 76.2%
Findings-Chart
5 4 3 2 1 Weight Factor8 35 30 21 16 Issues identified in weight factor
35.03% 31.19% 10.17% 8.90% 9.04% Average rate per factor175.1% 124.7% 30.5% 17.8% 9.0% Total Average by weight
Weighted Issue Rate - 23.8%
Weighted Assessment Score - 76.2%
Address the Issues
•Addressing your findings• Actual vs. Potential• Subject Matter Expertise• Cleansing Requirements
2-Weight/Impact
3-ProfileData
6-Address
Source Data
4-RevisitDefinitions,
Weights
5-Findings1-DefineIssues
Maintain Vigilance
•Maintain• Complete the cycle• Periodic review• Document score changes
2-Weight/Impact
3-ProfileData
6-Address
Source Data7-Maintain
4-RevisitDefinitions,
Weights
5-Findings1-DefineIssues
Why Do The Assessment?
• Quantify the quality issues
• Isolate true problems
• Proactive review – reduces the cost of resolving issues – reduces the risk of customer dissatisfaction
• Define the scope of issues
• Determine the resources required to address issues
Why Do The Assessment?
ProjectTimeline
When you find an Issue
Cos
t to
Ad
dre
ss a
n I
ssu
e
ProjectCosts
Why should it be done
TIME
Pay me now or Pay me later
When Should It Be Done?
• Every IT data project– Warehousing
– CRM
– ERP
– EAI
– M&A
• Ongoing based on– Criticality of the system
– Current status (score)
– Need to re-purpose data
Bibliography
Larry P. English: Improving Data Warehouse and Business Information Quality, John Wiley & Sons Inc., 1999
Jack Olson, Data Profiling: The Accuracy Dimension,Morgan Kaufmann, 2002
Thomas C. Redman: Data Quality for the Information Age,Artech House, 1996
PricewaterhouseCoopers, “Global Data Management Survey”, 2001