28
Understanding Data Quality Issues: Finding Data Inaccuracies rt DeMaio voke Software P Technical Sales Support

Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Embed Size (px)

Citation preview

Page 1: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Understanding Data Quality Issues:

Finding Data Inaccuracies

Art DeMaioEvoke SoftwareVP Technical Sales Support

Page 2: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Agenda

• Why is Understanding Data Important• Methodology for Assessing Data

– Defining– Weighting– Profiling– Revisiting– Finding– Addressing– Maintaining

• What is Profiling• Benefits of the Assessment

Page 3: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

What the Experts say…

• “Information quality is not an esoteric notion;it directly affects the effectiveness and efficiency of business processes. Information quality also plays a major role in customer satisfaction.”

- Larry P. English

Page 4: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

What the Experts say…

• “Poor data quality is costly. It lowers customer satisfaction, adds expense, and makes it more difficult to run a business and pursue tactical improvements such as data warehouses and re-engineering.”

- Thomas C. Redman

Page 5: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

What’s in Your DATA…

• “…three-quarters (of participating companies) reported significant problems as a result of defective data, with a third failing to bill or collect receivables as a result.”

- In a PricewaterhouseCoopers survey of 600 CIOs, IT directors or similar executives

Page 6: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

What is Data Quality?

• Accuracy of Content

• Structure

• Completeness

• Timeliness

• Presentation

Page 7: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Assessing Your Data

2-Weight/Impact

3-ProfileData

6-Address

Source Data7-Maintain

4-RevisitDefinitions,

Weights

5-Findings1-DefineIssues

Page 8: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Defining Issues

•Standard list•Key requirements

•Content•Structure•Completeness

•Update list by project or source

Source Data

1-DefineIssues

Page 9: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Defining Issues-sampleConstantsDefinition MismatchesFiller Containing DataInconsistent CasesInconsistent Data TypesInconsistent Null RulesInvalid KeysInvalid ValuesMiscellaneousMissing ValuesOrphansOut of RangePattern ExceptionsPotential ConstantsPotential DefaultsPotential DuplicatesPotential InvalidsPotential RedundantValuesPotential Unused FieldsRule ExceptionsUnused Fields

Source Data

1-DefineIssues

Page 10: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Weight Impact

•After the issues are initially identified:

• Some issues are more critical than others

• Weights are not priorities• Assign a weighting factor

(1-5)• Weighting factors

SHOULD change by project

2-Weight/Impact

Source Data

1-DefineIssues

Page 11: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Profile Data

•What does Data Profiling mean?

2-Weight/Impact

3-ProfileData

Source Data

1-DefineIssues

Page 12: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

What is Data Profiling?

The use of analytical techniques on data for the purpose of developing a thorough knowledge of itscontent, structure and quality.

A process of developing information about datainstead of information from data.

Page 13: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Information About Data: (Data Profiling)

30% of entries in SUPPLIER_ID are blank the range of values in UNIT_PRICE is 5.99 to 4599.99 there are 14 ORDER_HEADER rows with no ORDER_DETAIL rows

Information FROM Data: (not Data Profiling)

Texas auto buyers buy more Cadillacs per capita than any other state The average mortgage amount increased last year by 6% 10% of last year's customers did not buy anything this year

What is Data Profiling?

Page 14: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Profile Data

•This is multi-step process• Collect documentation• Review the DATA itself• Compare data to documentation• Identify and detail specific issues

2-Weight/Impact

3-ProfileData

Source Data

1-DefineIssues

Page 15: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Revisit

•Review the issues and weights• Should there be more or less issues

•What are they?• Are the relative importance of

each issue different?

2-Weight/Impact

3-ProfileData

Source Data

4-RevisitDefinitions,

Weights

1-DefineIssues

Page 16: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Findings

•Your findings tell others about the data

• Documented reports and/or charts• Results database• Quality Assessment Score

2-Weight/Impact

3-ProfileData

Source Data

4-RevisitDefinitions,

Weights

5-Findings1-DefineIssues

Page 17: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Findings-Chart

Sample Company Issue Findings

0

5

10

15

20

25

Issue Category

Co

un

t o

f Is

su

es

Constant

Definition Mismatch

Filler Containing Data

Inconsistent Case

Inconsistent Data Type

Inconsistent Null Rule

Invalid Keys

Invalid Values

Miscellaneous

Missing Values

Orphans

Out of Range

Pattern Exception

Potential Constant

Potential Default

Potential Duplicates

Potential Invalid

Potential Redundant

Potential Unused

Rule Exceptions

Unused

Page 18: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Findings-ChartIssues Possible

Issue Type Discovered IssuesConstants 1 59Definition Mismatches 4 59Filler Containing Data 1 59Inconsistent Cases 3 59Inconsistent Data Types 15 59Inconsistent Null Rules 6 59Invalid Keys 1 3Invalid Values 1 59Miscellaneous 10 59Missing Values 18 59Orphans 2 2Out of Range 3 59Pattern Exceptions 10 59Potential Constants 1 59Potential Defaults 1 59Potential Duplicates 3 59Potential Invalids 4 59Potential RedundantValues 21 59Potential Unused Fields 1 59Rule Exceptions 3 3Unused Fields 1 59

110 1070

Raw Score 89.7%

Page 19: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Findings-ChartWeight Issues PossibleFactor Issue Type Discovered Issues

4 Constants 1 592 Definition Mismatches 4 593 Filler Containing Data 1 591 Inconsistent Cases 3 592 Inconsistent Data Types 15 593 Inconsistent Null Rules 6 595 Invalid Keys 1 35 Invalid Values 1 591 Miscellaneous 10 593 Missing Values 18 594 Orphans 2 25 Out of Range 3 594 Pattern Exceptions 10 592 Potential Constants 1 592 Potential Defaults 1 591 Potential Duplicates 3 593 Potential Invalids 4 594 Potential RedundantValues 21 593 Potential Unused Fields 1 595 Rule Exceptions 3 34 Unused Fields 1 59

110 1070

Weighted Score 76.2%

Page 20: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Findings-Chart

5 4 3 2 1 Weight Factor8 35 30 21 16 Issues identified in weight factor

35.03% 31.19% 10.17% 8.90% 9.04% Average rate per factor175.1% 124.7% 30.5% 17.8% 9.0% Total Average by weight

Weighted Issue Rate - 23.8%

Weighted Assessment Score - 76.2%

Page 21: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Address the Issues

•Addressing your findings• Actual vs. Potential• Subject Matter Expertise• Cleansing Requirements

2-Weight/Impact

3-ProfileData

6-Address

Source Data

4-RevisitDefinitions,

Weights

5-Findings1-DefineIssues

Page 22: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Maintain Vigilance

•Maintain• Complete the cycle• Periodic review• Document score changes

2-Weight/Impact

3-ProfileData

6-Address

Source Data7-Maintain

4-RevisitDefinitions,

Weights

5-Findings1-DefineIssues

Page 23: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Why Do The Assessment?

• Quantify the quality issues

• Isolate true problems

• Proactive review – reduces the cost of resolving issues – reduces the risk of customer dissatisfaction

• Define the scope of issues

• Determine the resources required to address issues

Page 24: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Why Do The Assessment?

ProjectTimeline

When you find an Issue

Cos

t to

Ad

dre

ss a

n I

ssu

e

ProjectCosts

Page 25: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Why should it be done

TIME

Pay me now or Pay me later

Page 26: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

When Should It Be Done?

• Every IT data project– Warehousing

– CRM

– ERP

– EAI

– M&A

• Ongoing based on– Criticality of the system

– Current status (score)

– Need to re-purpose data

Page 27: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support
Page 28: Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support

Bibliography

Larry P. English: Improving Data Warehouse and Business Information Quality, John Wiley & Sons Inc., 1999

Jack Olson, Data Profiling: The Accuracy Dimension,Morgan Kaufmann, 2002

Thomas C. Redman: Data Quality for the Information Age,Artech House, 1996

PricewaterhouseCoopers, “Global Data Management Survey”, 2001