23
February 18-19, 2003 1 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Yang W. Lee [email protected] , [email protected] Northeastern University Phone: 1-617-373-5052 Fax: 1-617-373-3166 Information Quality in Context February 18-19, 2003 2 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Outline Introduction Examples: What is Data Quality? Background: Motivation and Related Work Research Questions Concepts Study: Sites, Projects, Data, Analysis Results: 3 Data Quality (DQ) Problem Patterns DQ Improvement: 10 Potholes (Root Causes) Summary and Lessons Learned

Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

1

February 18-19, 2003 1

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Yang W. [email protected], [email protected]

Northeastern UniversityPhone: 1-617-373-5052

Fax: 1-617-373-3166

Information Quality in Context

February 18-19, 2003 2

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Outline Introduction

Examples: What is Data Quality? Background: Motivation and Related Work

Research QuestionsConceptsStudy: Sites, Projects, Data, AnalysisResults: 3 Data Quality (DQ) Problem PatternsDQ Improvement: 10 Potholes (Root Causes) Summary and Lessons Learned

Page 2: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

2

February 18-19, 2003 3

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Introduction Examples

Rosetta Stone found in 1799, inscription deciphered and published in 1822.The overture of 1805FedEx in 2002

February 18-19, 2003 4

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

The 1805 OvertureIn 1805, the Austrian and Russian Emperors agreed to join In 1805, the Austrian and Russian Emperors agreed to join forces against Napoleon. The Russians said their forces forces against Napoleon. The Russians said their forces would be in the field in Bavaria by would be in the field in Bavaria by Oct. 20Oct. 20. . The Austrian staff planned based on that date in theThe Austrian staff planned based on that date in theGregorian calendarGregorian calendar. Russia, however, used the ancient. Russia, however, used the ancientJulian calendarJulian calendar, which lagged 10 days behind., which lagged 10 days behind.The difference allowed Napoleon to surround Austrian The difference allowed Napoleon to surround Austrian General Mack's army atGeneral Mack's army at UlmUlm on Oct. 21, well before the on Oct. 21, well before the Russian forces arrived.Russian forces arrived.Source: David Chandler, The Campaigns of Napoleon, New York: MacMillan 1966, p. 390.

Acknowledgement: A. Morton and S. Madnick

Page 3: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

3

February 18-19, 2003 5

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

FedEx 2002111502

010203

February 18-19, 2003 6

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Background and Motivation Documented risk, cost, impact of poor-quality dataUndocumented lost opportunitiesEveryday inconvenience

A global consumer product company wants to identify products made of the same materials for its global procurement plan

A major hospital faces difficulties in conducting cross-patient trend analysis for its proactive patience care program

An insurance company faces a dilemma of using their poor-quality marketing analysis results form making strategic business decisions.

Cumulated impact of poor DQ on organizational performance

Consumer dissatisfaction, unstable business operation, misguidedbusiness strategies, and missing business opportunities.

Page 4: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

4

February 18-19, 2003 7

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Related Work and Work-in-progressInformation Manufacturing Model (Ballou et al, 1998) Data quality Dimensions (Wang and Strong, 1996)Data quality in Context (Strong, Lee, Wang, 1997)Data quality Measurement ( Pipino, Lee, and Wang, 2002)Data quality Assessment (Lee, Pipino, and Wang, 2002)Information Product (Wang, Lee et al, 1998)Information Product-MAP (Pierce et al, 2002)Quality Information and Knowledge (Huang, Lee, and Wang, 1999)Interdependencies: Data and Process (Lee and Katz-Hass, 2002)Process-embedded Data Integrity ( Lee et al) Knowledge at Work for Data Quality (Lee et al)Rules in Data Quality (Lee et al)Context-reflective DQ Problem-solving (Lee et al)Journey to Data Quality (Lee et al, MIT Press, Forthcoming)

February 18-19, 2003 8

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Research Questions

How do organizations define data quality?What data quality problems arise in organizations?How do organizations identify, analyze, and resolve data quality problems?Are there common data quality patterns?

Across OrganizationsAcross DQ projects

Page 5: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

5

February 18-19, 2003 9

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Concepts

Data Production System

Data Consumer’s View

Multiple Data Quality Categories

February 18-19, 2003 10

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Data Production System

Page 6: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

6

February 18-19, 2003 11

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Data Consumer’s View of DQ

Quality data is data that is fit for use by data consumers (Wang et al, 1996)

IQ Category IQ DimensionsIntrinsic IQ Accuracy, Objectivity, Believability, ReputationContextual IQ Relevancy, Value-Added, Timeliness, Completeness, Amount

of informationRepresentational IQ Interpretability, Ease of understanding, Concise representation,

Consistent representationAccessibility IQ Accessibility, Access security

February 18-19, 2003 12

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Multiple DQ Categories

Intrinsic DQ: information have quality in their own right.

Contextual DQ: information quality must be considered within the context of the task at handRepresentational DQ andAccessibility DQ emphasize the importance of the role of systems

Page 7: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

7

February 18-19, 2003 13

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Study ConceptsDQ Project

Data-related actions taken to manage DQ problemsProblem finding (inquiry)Problem analysis (framing)Problem resolution (action)

DQ ProblemAny difficulty in collecting, storing/maintaining, and utilizing data.

DQ StakeholdersData collectorsData custodians (IS professionals)Data consumers

February 18-19, 2003 14

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Study Sites3 data-intensive and service-critical companies

Airline (GoldenAir)Hospital (BetterCare)HMO (HyCare)

Seriously attend to their IQ problemsVary in their computing environmentVary in how they attend to IQ

Software toolsIQA/DQATQM

Page 8: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

8

February 18-19, 2003 15

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Data CollectionCollected 42 DQ project histories DQ Project: data-related actions taken in an organization to manage DQ Problems

DQ Problem: difficulties in collecting, storing, or using data.

Interviewed information stakeholders: Information collectorsInformation custodians (IS professionals) Information consumersManagers for information collectors, custodians, and consumers

February 18-19, 2003 16

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Data AnalysisPerformed content analysis of 42 transcribed DQ project histories

DQ dimensions are the content analysis codes

Performed pattern analysis of coded projectsClassified projects: by overriding DQ concern into four DQ categoriesWithin project: chronological order of DQ dimensionsAcross projects: group by common patterns of chronological DQ dimensions

Performed embedded case analysis of each project

Page 9: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

9

February 18-19, 2003 17

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Example DQ Project Hospital

PROBLEM FINDING:TRACE DQA noticed a large increase in infectious disease patients

PROBLEM ANALYSIS:A possible error in collection and storage of dataCalled admissions to confirm this cause

PROBLEM RESOLUTION:Process: Trained personnel

Checked and Revised emergency room procedures Data: Admissions and IS work together to change data

February 18-19, 2003 18

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

ResultsThree DQ Patterns Identified

Intrinsic DQ patternInformation not used by consumersbelievability, reputation, objectivity

Accessibility DQ patternConsumers experience any barriers to accessing information accessibility, accessibility security, timelinessRepresentational DQ dimensions show up as underlying causes of accessibility DQconsistent, concise representation

Contextual DQ patternConsumer’s ( multiple) task (changing) context as critical context

Page 10: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

10

February 18-19, 2003 19

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Pattern 1: Intrinsic DQMis-match between several sources of the “same” data

Hospital: TRACE vs. STATUS (ex. daily hospital bed utilization)“consistency” vs. “accuracy”

Airline: manual vs. warehouse, MMS vs. warehouse

Starts as a believability issueOver time, poor reputation of sources

STATUS develops poor reputation for qualityMMS develops poor reputation for quality

Subjective production of dataHuman judgment in coding

Multiple sources of same data Judgement involved in data production

Questionable Believability Questionable Objectivity

Poor Reputation

Little Added Value

Data not used

(1)(2)

Poor intrinsic dataquality becomescommon knowledge

Information about causesof mismatches accumulate

Mismatchesexist

Data not used because of littleAdded Value and poorreputation

Information aboutsubjectivityaccumulate

Data production processviewed as subjective

DQ Pattern 1: Intrinsic DQ

Page 11: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

11

February 18-19, 2003 21

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Pattern 2: Accessibility DQ Technical Accessibility

Physical access (Airline)Computing resources (HMO)

Time to Access / Ease of Access:Amount of data (HMO)Privacy, confidentiality (HMO, Hospital)

Interpretability and Understandability:Coding, such as DRG coding (HMO, Hospital)

Representation and its Analyzability:Image and text data (HMO, Hospital)

Lack of computingresources

PoorAccessibilty

Privacy andconfidentiality

Access Security

Computerizing and data analyzing

Interpretability andUnderstandability

Concise andConsistent

Representation

Amount of Data

Timeliness

Barriers to data accessibility

Computerized datainaccessible whenneeded

Processing sloweddue to large datavolume; e.g.,weekend batchextracts

Large amount ofdata accumulated

Advanced ITpermitsstorage ofimage andtext data

Computerizeddata inaccessiblefor analysis due tolimited capabilitiesto summarizeacross image andtext data

Computerized datainaccessible becausemultiple specialists areneeded to interpret dataacross multiplespecialties

Computerizeddata coded,e.g., DRG andprocedurecodes

Technical data acrossmultiple specialtiesincluded in databases;e.g., medicalterminology, medicalmeasurements, andengineeringspecifications

Must protectconfidentiality

Computerized datainaccessible due totime and effort toget authorizedpermission toaccess

Computerized datainaccessible due toinsufficient systemsresources

Systemsdifficult toaccess;e.g.,unreliablenetwork

(3) (4) (5) (6) (7)

DQ Pattern 2: Accessibility DQ

Page 12: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

12

February 18-19, 2003 23

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Pattern 3: Contextual DQ

Mis-match between information available and what information is relevant and adds value for information consumers

Missing data -- the easy caseData bundling and analyzability -- the hard case

Consider the hard case

February 18-19, 2003 24

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Pattern 3: Contextual DQ

Data bundling and analyzabilityIssue is aggregation

Across record (transaction) analysis of dataOften across distributed systems

Incompatible, distributed systems (HMO)Bundling Unit (Hospital)

1970’s: procedures performed in the hospital1980’s: patient visit, disease1990’s: patient across all visits, diseases

Page 13: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

13

Data utilization difficulty

Operational dataproduction problems

Changing dataconsumers' needs Distributed Computing

Incomplete Data

Poor Relevancy

InconsistentRepresentation

Little Value Added

Inability to integrate oraggregate data results inpoor contextual DQ (datawith little value-added orrelevancy to dataconsumers' takes)

Computerized data arenot relevant to currentdata consumers' tasksdue to incomplete datafor analysis andaggregation

Dataproducersfail to supplycompletedata

Need for new dataNeed to aggregate databased on "fields"(attributes) that do notexist in the data

Need to aggregate,report and integrateacross autonomousand heterogeneoussystems

Integrated data fromdifferent systems addlittle value due toinconsistentlyrepresented data

(8) (9) (10)

DQ Pattern 3: Contextual DQ

February 18-19, 2003 26

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Organizational DQ Principles

Intrinsic DQ:Information has quality in its own right (Internal View)

Accessibility DQ:Information must be accessible, but secureInformation must be presented in a concise, but understandable representation.

Contextual DQ:DQ must be considered with the context of the task at hand

Page 14: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

14

Data Quality Problem Pattern

Multiple sources ofsame data

QuestionableBelievability

Poor Reputation

Little Added Value

Data not used

Judgement involvedin

data production

QuestionableObjectivity

Barriers to data accessibility Data utilization difficulty

Lack ofcomputingresources

Privacy andconfidentiality Computerizing and data analyzing

Operationaldata production

problems

Changing dataconsumers'

needs

DistributedComputing

PoorAccessibilty

AccessSecurity

Interpretabilityand

Understandability

Concise andConsistent

Representation

Amount of Data

TimelinessPoor Relevancy

Incomplete DataInconsistent

Representation

Little ValueAdded

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

February 18-19, 2003 28

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

The Road to Data Quality

Improving Data QualityAttend to Data Production Processes

Data collection Data storageData utilization

Attend to Key DQ Problems

Page 15: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

15

The Information Production Road

February 18-19, 2003 30

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

The Ten Potholes

1. Subjective information production2. Multiple sources of the same information 3. Information production errors4. Too much information5. Distributed, inconsistent information6. Storage of non-numeric information7. Lack of algorithms for non-numeric information8. Changing task environment of information consumers9. Security and privacy vs. accessibility

10. Lack of computing resources

Page 16: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

16

Subjective Judgment

Multiple Sources of Same Data

Page 17: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

17

Systemic Errors in Data Production

Large Volume vs. Timely Access

Page 18: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

18

Distributed Heterogeneous Systems

Advanced Analysis: Image and Text

Page 19: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

19

Nonnumeric Data

Environment/Market Change

Page 20: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

20

Access vs. Security and Privacy

Lack of Computing Resources

Page 21: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

21

Ten Potholes in the Road to Information Quality

InformationSources

InformationSystems

Infrastructure

TaskEnvironment

P1

P5

P4

P3

P2

P6 P7

P8

P9

P10

ComputerizedDatabase

InformationProduction

Process

InformationStorage &

MaintenanceProcess

InformationUtilizationProcess

MultipleSources

SubjectiveProduction

ProductionErrors

Too MuchInformation

Non-numericInformation

DistributedSystems

AdvancedAnalysis

Requirements

Changing TaskNeeds

Security &Privacy

Requirements

Lack ofComputingResources

Info

rmat

ion

Con

sum

ers

Information Custodians

Information Producers

February 18-19, 2003 42

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

The Information Collection RoadKey IQ Problems

Multiple Sources of the Same Information (duplicate production)Subjective Information ProductionInformation Production Errors

Page 22: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

22

February 18-19, 2003 43

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

The Information Storage Road

Key IQ ProblemsToo much informationDistributed, inconsistent informationStorage of non-numeric information

February 18-19, 2003 44

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

The Information Utilization Road

Key IQ ProblemsLack of algorithms for non-numeric informationChanging task environment of information consumersSecurity and Privacy vs. AccessibilityLack of Computing Resources

Page 23: Information Quality in Contextweb.mit.edu/tdqm/www/winter/L3Winter03.pdf · 2 February 18-19, 2003 3 Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop Introduction Examples

23

February 18-19, 2003 45

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Lessons LearnedAccuracy is necessary, but not sufficient for high DQ.Attend to evolving DQ problem: DQ problems change as business needs change over time (global, cross-functional, integration).Attend to the entire Information Production System.Attend to the root-causes of key common IQ problems. Look beyond technical accessibility.Recognize that DQ is evaluated in the context of the changing tasks of multiple data consumers.

February 18-19, 2003 46

Joint UC Berkeley – MIT Winter 2003 Data Quality Workshop

Key ReferencesLee, Y., Strong D., Kahn, B., and R. Wang, “AIMQ; A Methodology for Information Quality Assessment,” Information & Management, Vol. 40, Issue 2, December, 2002, pp 133-146. Huang, K. T., Y. Lee, and R. Wang, Quality Information and Knowledge, Upper Saddle River: NJ, Prentice Hall, 1999. Strong D., Y. Lee, and R. Wang, “Data Quality in Context,” Communications of the ACM, May 1997, pp. 103-110. http://web.mit.edu/tdqm