Enterprise Data Quality: A Pragmatic Approach

Information Systems Frontiers 1:3, 279±301 (1999)# 2000 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Enterprise Data Quality: A Pragmatic Approach

Amjad Umar, George Karabatis, Linda Ness andBruce HorowitzTelcordia Technologies, 445 South Street, Morristown,

NJ, 07960-6439

E-mail: [email protected]

Ahmed Elmagardmid*Department of Computer Sciences, Purdue University,

West Lafayette, IN 47907

Abstract. Enterprise dataÐthe data that is created, used andshared by a corporation in conducting businessÐis a criticalbusiness resource that must be analyzed, architected andmanaged with data quality as a guiding principle. This paperpresents results, practical insights, and lessons learned from alarge scale study conducted in the telecommunications industrythat synthesizes data quality issues into an architectural andmanagement approach. We describe the real life case study andshow how requirements for data quality were collected, how thedata quality metrics were de®ned, what guidelines wereestablished for intersystem data ¯ows, what COTS (commercialoff-the-shelf ) technologies were used, and what results wereobtained through a prototype effort. As a result of experiencegained and lessons learned, we propose a comprehensive dataquality approach that combines data quality and dataarchitectures into a single framework with a series of steps,procedures, checklists, and tools. Our approach takes intoaccount the technology, process, and people issues and extendsthe extant literature on data quality.

Key Words. data quality, data quality metrics, data qualitymethodology, process ¯ow through, distributed systems

1. Introduction

Data quality has emerged as a major issue recently

due to its potential severe impact on organizational

effectiveness. For example, a leading computer

industry information service indicated that it expects

most business process reengineering initiatives to fail

due to a lack of data quality (Wand, 1996). In addition,

Redman (1998) shows how poor data quality results in

the operational impacts (e.g., lowered customer

satisfaction, increased cost, employee dissatisfaction),

tactical impacts (e.g., poorer decision making, more

dif®culties in building data warehouses, more dif®-

culties in reengineering, increased organizational

mistrust), and strategic impacts (dif®culties in setting

and executing strategies, diversion of management

attention). In some cases, in particular the ®nancial

statements, the quality of data must be certi®ably free

of certain types of errors (Kaplan, 1998). As other

examples, wrong price data in retail databases may

cost American industry as much as $2.5 billion in

overcharges annually (English, 1996) and dirty data is

causing major problems with data warehouses (i.e.,

many users are retrieving wrong information from

their data warehouses) (Cellco, 1995).

Although data quality is gaining some recognition,

views about data quality and approaches to improve

data quality differ widely (English, 1996; Wang,

1998; Orr, 1998; Redman, 1992). The fundamental

problem with the extant concepts, methodologies, and

case studies is that a linkage is not made with the

existing day-to-day activities. For example, state of

the practice in data architectures typically does not

include data quality issuesÐthus most data quality

considerations are left out of the regular day-to-day

corporate practices. There is a need to extend the data

architecture work by explicitly including the fol-

lowing two data quality considerations:

* How to measure data quality (i.e., metrics).* How to improve data quality.

*Participated in project while at Telcordia Technologies.

279

This paper presents an overall strategy that

synthesizes data quality issues into an architectural

vision that is supported by a series of steps,

procedures, checklists, and tools. This strategy

attempts to bridge multiple areas by ®rst de®ning

key concepts and outlining a framework for discus-

sion (Section 2). In Section 3, we describe an

extensive data quality case study that consists of

several interrelated activities: (a) problem analysis

that involved a series of workshops with project

collaborators from participating companies and

identi®cation/prioritization of key problems; (b)

design for the prevention and amelioration of poor

data quality and inconsistency based on data

¯owthrough and data inconsistency/data quality

metrics; (c) assessment of COTS (commercial off-

the-shelf ) and near-COTS technology that ranged

from data cleanup tools to work¯ows and multi-

database transactions/multisystem integration tech-

nologies; and (d) demonstration of the impact of

selected technology on real-life large scale data

quality problems. Section 4 proposes, based on our

experience gained and lessons learned. an extensive

approach that combines data quality and data

architectures into a single conceptual framework. In

particular, our experience has shown that the proposed

strategy is quite effective because it takes into account

the technology, process, and people issues and can be

used to design data quality as an integral part of

enterprise data. The focus of this paper is on

pragmatic issues and practical insights gained through

involvement in large scale data quality problems.

2. Data Quality Concepts and Approaches

There are different views and de®nitions of data

quality (see, for example, English, 1996; Wand, 1996;

Moriaritty, 1996). We will use the following opera-

tional de®nition of data quality (Redman, 1992):

A product, service, or datum X is of higher quality

than product, service, or datum Y if X meets

customer needs better than Y.

This de®nition appears to be generally accepted (i.e.,

the quality must be de®ned in terms of customer

satisfaction) and generally conforms to other de®ni-

tions. For example, Tayi and Ballou's (1998) statement

of ``data quality can best be de®ned as ®tness for use''

resonates with this de®nition. However, it needs

further re®nement and elaboration. For example,

Wand and Ward (1996) show 26 dimensions of data

quality, Wang (1998) de®nes quality in terms of

information product (i.e., it re¯ects the quality of the

characteristics and processes that manufacture the

information product), Orr (1998) contends that the use

of data must be increased to improve data quality, and

Larry English (1996) suggests that data quality should

be considered in terms of data de®nition, data content,

and data presentation.

We focus on the following quality attributes of

data, initially proposed by Ballou and Pazer (1985)

since they include other views:

* Accuracy: re¯ects correctness with respect to real

life.* Consistency: two or more data items do not con¯ict

with each other.* Currency: how recent is the information.* Completeness: degree to which values are present

in a data collection.

These attributes are related to the data content and

provide the core data quality metrics. Additional

important data quality attributes are:

* De®nition: re¯ects that data must be clearly and

unambiguously de®ned.* Access: shows the ease with which the users can

access the data.* Presentation: re¯ects the style with which the data

is presented.

Data quality needs to be improved for different

reasons:

* Economic reasons (i.e., how can I reduce the cost

of service).* Service quality reasons (i.e., how happy are my

users).* Process ef®ciency (i.e., maximizing throughput).* Platform ef®ciency (i.e., maximizing investment in

hardware/software).

To improve the quality of data, several assessment and

improvement activities need to take place throughout

its life cycle. The approaches to obtain data quality

fall into the following broad categories:

280 Umar, Karabatis and Ness

* Data cleanup.* Process cleanup.

Data cleanup involves use of a tool to identify ``bad

data'' (i.e., not accurate/consistent/current/complete)

and then elimination of bad data through automated

and/or manual processes. A wide variety of ``data

scrubbers'' are commercially available to perform this

task. The main limitation of this approach is that some

data cannot beeasily veri®ed tobecorrect.Forexample,

no edit can verify someone's home address. In addition,

data cleanup needs to be a periodic effort that must be

repeated several times over the life cycle of data.

Process cleanup goes beyond data cleanup and

concentrates on the activities that tend to pollute clean

data. The main activities involved in process cleanup

are:

* Establish quality metrics.* Monitor the data life cycle for quality pollution.* Use statistical quality control and process manage-

ment to maintain desired data quality.

For total quality management, both approaches (data

cleanup and process cleanup) are needed.

Large and complex systems, as shown in Fig. 1,

involve many components (e.g., the data, the

application software that operates on the data, the

underlying platforms on which the data resides and

that are used to access the data, and the overall process

for using and managing the system). Different players

(the users of the system, the managers, the business

customers, the developers, etc.) view these compo-

nents at different levels of detail. Conceptually,

different data quality views can be expressed in

terms of each component to indicate its behavior and

to improve its quality. Examples of the views, from a

data quality perspective, are:

* Quality of the data itself (e.g., accuracy, currency,

consistency, completeness).* Quality of the application software that operates on

the data (e.g., software bugs).* Quality of the platform (e.g., the performance of

the access mechanisms).* Quality of the management and operational

processes (e.g., manual interventions, errors,

delays, ¯owthroughs, user satisfaction level).

Fig. 1 shows the main components and the key data

quality metrics associated with each component. The

model categorizes these components in terms of

processes, people and technologies because each of

these categories need to be measured and improved.

We also show conceptual inter-relationships between

different levels of metrics. We use this model for our

view of data quality. This view does include several

existing views such as English (1996), Moriarity

(1996), Orr (1998), Redman (1992), Redman (1998),

Wand (1996) and Wang (1998).

3. Case Study: Data Quality inTelecommunications

This section describes a case study of how a family of

data quality problems in the telecommunications

industry were addressed through a systematic

approach. The case study is based on a project that

was funded by the Regional Bell Operating

Companies (RBOCs) to address the issues of data

reconciliation speci®cally and data quality at large.

Fig. 2 presents the project activities, discussed below,

as a work¯ow diagram.

3.1. Problem identi®cation workshopsAs a starting point, we conducted a series of one-day

workshops with project collaborators from partici-

pating companies to identify/prioritize key problems.

The workshops were attended by dozen to three dozen

staff members that included data architects, database

administrators, managers of various customer service

units, and technology consultants. Each workshop

started with an overview of the data quality issues to

introduce key concepts and to provide a common

framework for discussion. The overview was fol-

lowed by an open brainstorming session on the data

quality problems encountered by the sponsors, a

prioritization/classi®cation session to focus on the key

problems, and a detailed discussion session to quickly

analyze the high priority items.

In all, more than 80 issues were raised by the

sponsors at these workshops. In order to obtain a

meaningful view of the issues, they were categorized

into 15 broad areas that were ranked by the customers

in terms of business need (see Table 1) The overall

priority ranking of a category proportional to the

number of issues in that category, summed over all

sessions. For example, data quality metrics were of

Enterprise Data Quality 281

Fig. 1. Conceptual model for metrics.


Fig. 2. Data quality (DQ) project plan.

Table 1. Analysis of data quality problems

Broad issues Priority Category of data quality problem

*Inconsistency among systems H Data

*Process improvement and mechanization H Process

*Need for metrics H Process

*Flow through between systems H Software

*Root causes and data redundancy H Data

*System architecture and system evolution M Software

Standardization M Process

Inconsistency with reality M Data

Data input/validation M Data

Data access and security M Software

Single primary source of data M Data

Communications/administrative complexity L Process

Ownership and accountability L Process

Methodology L Process

Recovery from data inconsistencies L Software


major concern to multiple participants in all work-

shops we conducted and were thus ranked very high.

After the workshops, we also attempted to cast the

15 broad issues into various data quality problem

categories such as data itself, the application software

that operates on the data, the underlying platforms on

which the data resides and that are used to access the

data, and the overall process for using and managing

the system (see Table 1, column 3). During this

categorization, we noted that many high priority

problems were not related to the quality of the data

itself but were instead concerned with the process and

platform issues. This was a surprise to us. In addition,

many data quality problems discussed in these

workshop were directly or indirectly related to

location data (i.e., location of customers, telecom

equipment, and work centers). This is because the

location information plays a central role in the

telecom systems.

Let us now brie¯y review the broad category of

issues that were raised in these workshops.

Inconsistency among systems. Inconsistent infor-

mation between different systems causes serious

problems. The same information does not exist in all

required systems, and even when it does, it is

represented differently in different systems. A better

way is needed to keep replicated data synchronized.

The corporate data solution of some clients was

examined and it was found that different ferent kinds

of addresses from different systems represent the

same entity. This causes business problems (25% of

the customers in some cases do not receive their bill

due to bad address). There is also a low degree of

automation in the process. This, coupled with a high

degree of repetitive data entry to multiple systems,

exposes the process to ubiquitous data reconciliation

problems. In particular, it is likely that data entry

errors will occur, causing data discrepancies between

the systems. It is also likely that some or all updates

will not be performed to a particular system, resulting

in a system containing incomplete data. The following

example illustrates this situation. When a central

of®ce (CO) is sold, if that information is not entered

promptly, then an automated inventory system will

send equipment to the new owner but the old

client pays. In addition, old client pays tax on the

equipment, as it is still carried in their database. A

single interface to all systems is needed to access

location information.

Process improvement and mechanization. The

processes need to be improved and automated for

lasting effects. The common scenario is: a manager

comes in, ®xes the databases, gets promoted, leaves,

and the problems resurface. The issue is really one of

integrating disparate processes. The customer data is

critical because it is important to quantify return on

any customer projects. The main issues discussed in

this area were:

* Our clients need to manage issues between

regulated and unregulated sides of the market.* There is a need to mine information in customer

databases that will allow our client to offer new

products to new customer segments (and to

identify those segments).* Some clients need a uni®ed customer number.

There is a need to know if someone is a customer

and how to identify them. Name and address

present several problems. Some databases use

telephone numbers, while others use circuit IDs.

There is a need to be able to ®nd all telephone

numbers and circuit IDs that are associated with a

customer.

Need for metrics. Metrics are needed to drive

behavior to justify investment and detect self-funding

opportunities. Monitoring of metrics is an important

issue in data reconciliation. For example, one should

be able to estimate the ®nancial impact due to data

reconciliation. In particular, the following issues were

raised:

* How to measure the impact of data quality

problems?* How to measure the improvement of data quality

problems?

Metrics must include data reconciliation cost

savings (e.g., labor costs, labor/error resolution,

overtime labor/retry error resolution), performance

(e.g., volume work orders, errors, backlog, average

time/error), and operations (volume of orders per

center, volume of errors per center, volume of backlog

per center, average time of handling an error vs.

¯owthrough).


Flowthrough between systems. Current ¯ow-design

is a cause of data inconsistencies. One suggested

solution was reversion to batch processing. Some

clients noted that batch processing is inconsistent with

their need to quickly propagate customer data.

Root causes and data redundancy. Software evolu-

tion can cause data quality to decrease, since there are

no procedures for making old data to be consistent

with new integrity rules. One root cause of data

inconsistencies among provisioning and operations

databases is the fact that the processes ®rst developed

as isolated islands and were gradually pieced together.

Some systems have many duplicated databases. This

is a cause of signi®cant data inconsistency.

System architecture and system evolution. Re-

dundant data architecture is a root cause of data

inconsistencies. A well designed architecture is

needed to eliminate data silos (i.e., ``smokestack

systems''). Many applications and projects do

local optimization, subverting global optimization.

Some of the reasons are:

* Unreliable synchronization technology.* Time pressures (get things up and running

quickly).* Lack of cross-application responsibility, data and

applications knowledge, and overall data architec-

ture.

There is an economic justi®cation for a better

architecture. Databases proliferate because people

can't obtain access to needed data; they create a

database and the project manager controls it.

Architectures are very closely related to ¯owthroughs

and ¯ow-design.

Standardization. Uniform valid point of entry is

needed. For example, some systems accept non-

standard format. In addition, a common format is

essential. In some cases, address information is stored

in more than 20 different data ®lesÐthe addresses are

of various types (e.g., postal, client speci®c, etc.) and

obey various rules. A uniform valid point of entry

with a single owner is needed.

Inconsistency with reality. Many information items

are inconsistent with reality. For example, it is

dif®cult to ®gure out which equipment belongs to

which user.

Data input/validation. There is a need to clean

incoming data. This includes single enforced input

interface and management of downstream changes.

Data input validation implies enforcement of stan-

dards. Thus, most of standardization issues apply

here.

Data access and security. A standard is needed for

data extracts. There is the need to access legacy

systems and a number of other databases to obtain

information. People don't know enough details about

these systems, and when they ask, they are faced with

constant problems.

Single primary source of data items. A single

repository for data items is needed. In particular, this

represents the need for reconciliation of metadata, by

having a single system act as a steward. A single

repository would provide the de®nitive source of

metadata information for all the corporate data in the

various systems.

Communications/administrative complexity. Soft-

ware releases and upgrades can impact other systems.

There is a need for a repository for proposed

changes that must be adequately documented.

Ownership and accountability. Reference data

should not reside in individual applications.

Owner(s) should be identi®ed and propagated as

needed from a single source. Data ownership is

needed for data elements.

Methodology. A methodology is needed for con-

tinuous quality improvement. Some of the items

mentioned were: how to integrate data quality metrics

and process, a forum is needed to discuss data quality

issues, need coordination of existing efforts, need to

integrate data quality into applications, and improve

processes. During the implementation of process

among various systems, one client needs to do a

mapping between the databases and the data models.

They require support for technical and implementa-

tion issues and performance, and they need to quantify


the economics of the applications. A methodology is

needed for continuous quality improvement.

Recovery from data inconsistencies caused by soft-ware bugs. Software bugs cause data problems that

sometimes never get cleaned up. No procedures for

data cleanup (i.e., recovery) are included with

software patches.

3.2. Data quality metricsAfter the problem identi®cation, we decided to

quickly focus on data quality metrics since it was

rated the highest in the workshop.

As shown in Fig. 1, and discussed previously,

software systems involve many components (e.g., the

data, the application software that operates on the

data, the underlying platforms on which the data

resides, and the overall process for using and

managing the system). Conceptually, a large number

of metrics can be kept for each component to measure

its behavior and to monitor its improvement.

Examples of the metrics, from a data quality point

of view, are:

* Metrics of the data itself (e.g., correctness,

currency, consistency, completeness).* Metrics of the application software that operates on

the data (e.g., software bugs).* Metrics of the platform (e.g., the performance of

the access mechanisms).

Candidate metrics for data quality

a. Customer Related Metrics

* Customer satisfaction (user view)

b. Data Metrics

* Data accuracy (e.g., wrong V&H coordinates,

wrong pairs available)* Data currency* Data completeness* Data consistency (same ®eld in different systems)* Which data items are the dirtiest?* Data usage (which data items get used most often

by whom)

c. Platform Metrics

* Storage occupied* Data access: ease of use* Data access: response time* Platform reliability

d. Application Software Metrics

* Changes in data structures (e.g., ®eld lengths)* Number of changes made to software due to

changes in real world (e.g., changes in copper

facilities)* Application software bugs

e. Process Metrics

* Frequency of manual input (normal process, not

request for manual access (RMA).)* Errors between systems (i.e., OK for a system but

an error when passed along to another one* Money (business) lost due to wrong quotes* Percentage of held orders that can be attributed to

bad inventory data* Field technician time wasted due to bad data

(addresses, inventory)* Accuracy of property records for tax purposes* Accuracy of location data for tax/tariff

purposes* Number of telephone calls (RMAs) produced due

to incomplete data supplied by ®eld engineers* Number of calls received from customers due to

missing information in the forms* Number of sources of information looked at before

a customer question can be answered* Number of times a question is generated due to bad

data* Number of times ( percentage) the RMAs were not

handled correctly (i.e., second order RMAs)* Retries due to staff errors* Successful ¯owthroughs and things done in less

time than expected* Fallouts in end-to-end ¯owthroughs for provi-

sioning* Critical dates met versus missed* Labor/error resolution* Overtime labor/retry error resolution* Volume of service orders by circuit id* Errors in dealing with service orders* Backlogs* Average time/error* Volume of orders per center


* Volume of errors per center* Volume of backlog per center* Metrics of the management and operational

processes (e.g., RMA, errors, delays, ¯owthroughs,

user satisfaction level)

Based on the feedback from the Requirement

Solicitation Workshops, response to the survey

questionnaire, and general literature surveys, we

developed a set of key metrics for these components.

The resulting metrics are listed in the sidebar

``Candidate Metrics for Data Quality''. Naturally,

not all of these 40 metrics are neededÐyou need to

reduce this list based on metric priority, method of

measurement, frequency of measurement, cost of

measurement, risk of ignoring, and other appropriate

factors. The following table (Table 2) shows how you

can select data quality metrics. The main idea is to

focus on those metrics that have the highest business

impact. The best candidates for selection are those

metrics that have high priority, low cost of measure-

ment and high risk of ignoring. We are illustrating the

use of this table through a few metrics. We used a

similar table to evaluate and select the most

appropriate metrics out of the 40 we identi®ed.

After a metric process has been established and

implemented, a quality control process must be

instituted to monitor, evaluate and control the process

based on the metrics. The focus is on detection and

elimination of special causes that appear in the

process. In particular, the process performance is

predicted/measured against:

* Process goals (e.g., manual interventions must not

exceed 100 per week).* General trends (e.g., gradual improvement of data

quality).

* Variation control (e.g., learn root causes for

pronounced variations).

Numerous examples of quality control in manu-

facturing can be found. For example, many

manufacturing organizations target 100 defects per

million parts (Motorola has targeted 3 defects per

million). The main challenge is to monitor and manage

variations against targeted goals for improvement.

3.3. Intersystem data ¯owIntersystem data ¯ow in large scale telecom systems

was ranked as one of the most important problems in

the workshops. In a data ¯ow process, data is

generally received by a system (either from one or

more other systems, or directly, as input data),

processed (and perhaps altered) by that system, and

then passed to one or more other systems. The

processing system may even create new data. Data

may be committed to permanent storage ( perhaps

after transformation) within the processing system

and/or may be sent to other systems.

Although discussed frequently in the literature

Umar (1998), it is generally not the case that data in

multiple systems are accessed or modi®ed concur-

rently. In particular, read and/or write operations

under the same transaction umbrella that involve

multiple systems are seldom required in telecommu-

nications systems. However, sometimes a system may

need to request some reference data from one or more

external systems.

Typically, the same data is passed through and

processed by many systems, thus the data must be

viewed in a context larger than that of any individual

system. Such a view should include both representa-

tion (syntax) and interpretation (semantics) of

data.

Table 2. Quality metrics selection example

Quality metric

(examples)

Metric priority

(L, M, H) in

terms net cost

Method (tool)

for measurement

Frequency of

measurement

suggested

Cost of

measurement

(L, M, H)

Risk of ignoring

(L, M, H)

Selection decision

(Yes//No)

Customer dissatisfaction H Customer surveys Once a month

(or at the end of

each assignment)

H H Yes

Disk storage overruns L System utility Daily L L No

RMAs (requests for

manual assistance)

H PAWS report Daily M H Yes

Retries H PAWS report Daily M H Yes


Data ¯ow is a complex problem, involving many

issues that must be considered in order to design a

¯ow that will be successful and that will minimize

data reconciliation problems. It is not suf®cient to

consider data ¯ow in just a local context (i.e., with

respect to the systems that a given system either

directly receives data from or directly sends data to).

Rather, it is critical that data ¯ow be considered in the

context of the enterprise and its business goals and

processes. In particular, it is important that such a

global context be one in which unanticipated data

¯ows can be accommodated, as systems evolve and

new systems are created.

It is also important that we be able to accommodate

legacy systems in the data ¯ow, even though they

might not be architected in a way that would be

desirable if we were designing them ``from scratch''.

We developed a set of recommendations for

effective data ¯ow. A few of these recommendations

are listed in the sidebar ``Sample Data Flow

Recommendations''.

3.4. Assessment of commercial technologies fordata qualityA diverse array of technologies, available from a

variety of vendors, can be used to address many of the

data reconciliation problems identi®ed during the

Workshops. These technologies come with varying

price/performance ratios, offer a wide range of

features, and are available on different computing

platforms. Examples of the technologies of particular

interest to us are data scrubbers, data warehouse tools,

replication servers, work¯ow systems and multi-

database systems.

Table 3 summarizes the results of analyzing the

potential impact of the various COTS technologies to

address the issues that were raised during the

requirements workshops. The potential impact is

quanti®ed as high, medium and low to indicate the

possible effectiveness of COTS technologies to

address the issues raised during the workshops. As

can be seen, there is no single solution to the entire

problem, however a well balanced combination of

the aforementioned technologies can alleviate the

problem drastically.

Sample data ¯ow recommendations

Major Global Organization Recommendations

* Distinct systems should have distinct purposes,

should not cause the same kinds of changes to the

same data items, and should not cause the same

kinds of physical phenomena to occur to the same

real world objects.* A system that requests modi®cations to particular

data should have a high proximity to the system

that stewards that data* A system that has a need for highly accurate data

should have high proximity to the system that

stewards that data

Table 3. Relative impact of solution technologies (L, M, H)

Issues Data warehouses Replication servers Work¯ow systems Data scrubbers Multidatabase systems

*Location problems M H M

*Inconsistency among systems M H H L L

*Process improvement and mechanization L M H

*Metrics

*Flow through between systems L L H

*Root causes and data redundancy H M

*System architecture and system evolution M M M

Standardization M H L

Inconsistency with reality M L

Data input validation L L

Data access and security L H

Single primary source of data H

Communications/administrative complexity

Ownership and accountability

Methodology H

Recovery from data inconsistencies M H


* Data should ¯ow from a system that requires higher

precision to one that requires lower precision.* Data should ¯ow from a more constrained system

to a less constrained system for that data.* Data should originate with the system that is the

steward of the data, and should ¯ow to-systems

that need to use the data.* A tree structure is the preferred organization for

data ¯ow (no loops).* Race conditions should be designed out of a data

¯ow process. This includes scenarios where the

same data is sent to a system from multiple

sources, as well as the scenario where different,

but dependent data are sent to a system from

multiple sources.* A mechanism (such as sequence numbers), should

be provided with data transmissions, and/or a

uniform format should be provided so that systems

can determine an appropriate order in which to

process the data.

Major Metadata Recommendations

* All systems that contain the same data items

should follow identical standards for the character-

istics of those data items, as well as for integrity

constraints on those data items.* The same data should be converted by multiple

systems in the same manner.* Consistency of integrity constraints, business rules,

and data models need to be considered in a global

context.

Major Data Validation Recommendations

* Any particular data item should be manually

entered into a single system.* Data should be entered into the system that is the

steward of that data* If data is entered into a target system that is not the

steward, that target system should essentially act as

an interface system to the steward, and forward the

data to the steward for entry into the stewarding

system.* Data should be validated within each system that

the data resides in a ¯owthrough process.* Data may need to be validated, relative to a business

rule, as a set, and not just as individual facts.* Data needs to be either rejected or marked as being

invalid as soon as it is detected.

* A feedback loop to the steward would help correct

data that a downstream system considers invalid.

The steward should also be sent an indication of

why the data is invalid.* Data that is determined to be invalid should not be

passed on to other systems downstream.

Major Data Flow Enablement Recommendations

* Data that is highly dynamic should not be updated

by multiple systems.* Data that needs to be validated together should be

sent as a unit.

Major Intersystem Access Recommendations

* Distributed Transaction Processing is not currently

required as an access paradigm for data ¯ow, but it

may be appropriate to consider it at some point in

the future.* Data should never be simply sent to another system

under the assumption that it will be received, some

form of guaranteed delivery or acknowledgment

should be used.* Appropriate synchronization mechanisms should

exist to ensure either that data is received in a

correct order for processing or that the receiving

system can determine the correct order to process

incoming data.

Major Data Replication Recommendations

* Data replication should be minimized, and should

only be done for very good reasons.* If data is replicated in more than one place, then the

most accurate source of the data is the system that

stewards the primary enterprise copy of that data.* Data that is highly dynamic is a poor candidate for

replication.* Data replication may involve more than simply

producing a new version of a database.

Major Procedures and Design Process Recommend-

ations

* Automation of data ¯ow should be implemented

where it is economically feasible to do so* Evolution should never occur without the knowl-

edge and participation of representatives from

other systems that could potentially be affected.


* There needs to be a carefully managed schedule for

phasing in new systems or new versions of systems.

Data Warehouse (DW) tools can be utilized in a

variety of ways within the framework of data quality.

Although DWs are voluminous and cannot replace

day-to-day operational systems due to performance

overhead, they can be used as a point of reference for

other systems. DWs help in `ìnconsistency among

systems''. DWs can also provide some standardiza-

tion that other systems can refer to, since they usually

contain information from several other operation

systems. In addition, DWs can help to a limited

extent in process improvement and mechanization

because they contain a lot of informational context.

Replication servers can alleviate the inconsistency

among systems in cases where there is replicated

information stored in various systems. They can help

in process improvement due to their automatic way of

propagating information. They are also designed to

support a scheme with a primary source of data and

secondary replicas. Flowthroughs are improved in

the presence of a replication server, and they

provide an excellent tool to maintain consistent

copies of replicated data. Some replication servers

support conditional replication which is extremely

useful in a telecommunication company. Speci®-

cally, one can decide how often the replicated

data will be updated and under what speci®c

conditions.

Work¯ow systems mainly help in designing ¯ows

among various systems. They contribute to Process

Improvement and Automation because they provide

tools to change ¯ows and re-direct them among

various systems easily, with no signi®cant program-

ming effort. Updating ¯owthroughs can become more

effective with the presence of a work¯ow system.

Work¯ow systems can support system evolution

speci®cally due to their ¯exibility of ¯ow re-design.

Data scrubbers provide useful tools for data entry

validation and intrasystem cleanup, one at a time. This

capability of validation provide a means of standar-

dizing interfaces and help minimize inconsistency

among systems.

Multidatabase technology can be used in a variety

of facets in data reconciliation. Particularly they are

useful as ``middleware'' that provide connectivity

between systems. They can help in access and

authorization of systems.

Although the extant COTS technologies exhibit

a great deal of potential to address many of the

issues raised during the workshops, the full bene®ts

may be somewhat limited due to the following

limitations:

1. In general, data reconciliation requires a mixture of

COTS. Many COTS do not work with each other

(i.e., work¯ow systems do not work with replica-

tion servers). Integration of COTS is an area of

future work.

2. Many existing tools do not work on legacy data

sources such as IMS, VSAM, etc. However, many

RBOC data reconciliation problems require legacy

data reconciliation.

3. Work¯ows with added capabilities for distributed

transaction processing are needed to deal with the

issues of long (Saga) transactions.

4. Web interfaces to existing tools are just beginning

to emerge (e.g., Web access to work¯ows, Web

access to data warehouses). This area needs to

mature somewhat to be effective.

5. A potential area of future work is development of

an intelligent, Web-based interface that invokes

different technologies (e.g., data scrubbers, repli-

cation servers, work¯ows, multidatabase middle-

ware, data warehouse tools) based on a problem

type.

3.5. Implementation: the pilot projectTo demonstrate how some of the technologies can

work together to solve large scale data quality

problems, we created a pilot project. Its purpose was

to design and implement a prototype to improve the

quality of data being used in operational systems of

telecommunication applications. The pilot project

evolved around two existing Telcordia operational

systems: CLONES and LOC.

CLONES is a system responsible for creating

unique codes to name and identify various entities

such as equipment, building locations, etc. The entire

system is built on top of a relational database

(INFORMIX in Unix environments) which manages

these unique codes on behalf of Telcordia clients.

LOC is another Bellcore system comprising of

applications built on top of a relational database (DB2

on MVS) that stores and maintains location informa-

tion. In particular, LOC contains geographical

location information about planning and forecasting

areas, telecommunication equipment inventory, ven-


dors, customers, etc. LOC also stores relationships

among these locations.

When a new entity is given a location code in

CLONES, it must also be inserted into LOC, so there is

an information ¯ow from CLONES to LOC regarding

locations. However, CLONES is a separate applica-

tion that does not have detailed information about

locations, regarding a new entity. This information is

added in LOC but often does not exist in CLONES.

Since CLONES and LOC are independently updated,

the databases can become inconsistent. The conse-

quence of such type of low data quality becomes

evident when some application gets invalid informa-

tion from either CLONES or LOC and continues based

on erroneous or invalid information. In this Section

we will outline the replication and data scrubbing

technologies we used to improve the data quality in

this environment.

Setting up the environment. Since we did not want

to interfere with the daily operations of the actual

CLONES and LOC while they were on operation, we

decided to get copies of these databases and set up our

pilot system on these copies. This allowed us more

¯exibility with experiments such as schema changes,

workarounds, etc. on copies of the actual data stored in

our local PC databases. Since CLONES is imple-

mented on an INFORMIX database running on UNIX,

and LOC is on a DB2 running on MVS, we had to get

the CLONES and LOC schemas as well as data

extracts out of the original databases. We also chose to

use a commercial replicator between CLONES and

LOC. For the purposes of the pilot project Microsoft

SQL Server 6.5 proved to be adequate to carry out the

experiment since it comes with a data replicator that

can send data to another MS SQL Server, or to another

database that has 32 bit ODBC drivers compatible

with those of MS SQL Server (for example ORACLE).

Therefore we created extracts from CLONES and LOC

which we inserted into two separate SQL Server

databases running on different PCs.

* CLONES: Data for several tables were extracted as

ASCII delimited ®les and were sent to our

environment via FTP. Then they were entered

into the SQL Server in a schema that was identical

to the INFORMIX one, using a ``bulk copy'' utility

of SQL Server. However, dates were creating

errors and we had to enter the date columns as

strings. Translation back to dates could be done

later on.* LOC: Data was extracted from the LOC database,

which is a DB2 database on MVS. However,

several tables in LOC contained columns that

represented data that was not present in CLONES.

Fig. 3 depicts the overall architecture of the pilot

project. We addressed data quality in three major

categories: Input validation, data scrubbing and data

replication. We added an input validator on top of

CLONES and LOC in order to clean new data entries.

We also addressed ``dirty'' data already existing in the

databases through a data scrubber. Once the data in

CLONES had better quality it was replicated into LOC

automatically. However, LOC was adding more

columns to the data and has more detailed information

about locations. To avoid the case of overwriting such

additional information through automatic replication

from CLONES we also sent such data back to

CLONES to update its records with more recent and

correct information. In the following we will describe

in more detail each tool that we used to improve data

quality of those systems.

Input validation, standardization and scrubbing.

One of the major sources to ``dirty'' the data in the

databases is due to human error, especially when data is

being entered into the system. Our philosophy was to

``trap'' such occurrences as early as possible. For this

reason, we decided to ®lter and standardize name and

address information after being entered by the user, and

before being stored into the database, thus providing

input validation. Therefore, we prototyped one of the

CLONES data entry screensÐthe screen used to insert a

building code into CLONES. The user enters the

building name and address in free format. The name

and address are then passed to a software scrubbing tool

for decomposition into their constituent components,

and standardization into a predetermined format.

The use of standardization tools ensures that

address information in the database is entered in a

prede®ned way and minimizes human input errors.

For example, ``Main St.'' and ``Main Street''

represent the same entity, however, from the database

viewpoint they could create two separate rows. In

cases that keys were automatically created (such as

surrogate keys) the situation could easily lead into


unwanted cases of having two distinct keys repre-

senting the same entity.

We used a commercial tool for data standardization

from Innovative Systems, Inc. The ISI Edit tool

standardizes and decomposes names and address into

components that in turn are used in SQL statements to

query or update the database. The ISI Edit routine

classi®es an input line into one of 12 line types. For

our prototype, the line types of interest were:

`Ò''ÐOrganizational (used for building name,

e.g., ``Premier Widget Corp'').

``S''ÐStreet (the address line containing a street

name, e.g., ``445 South Street'').

Fig. 3. Final pilot project architecture.


``C''ÐCity (the address line containing the city,

e.g., ``Morristown, New Jersey 07960'').

For each input line, ISI Edit returns a standardized

line. ISI Edit standardizes organizational suf®xes

(which may appear in the building name), street types,

directional pre®xes, directional suf®xes, subaddress

identi®ers, rural identi®ers, box identi®ers, building

identi®ers, states and provinces.

For certain line types (including ``S'', and ``C''),

the system also returns ®xed-®eld data elements and

codes. For example, the street line (i.e., type ``S'') is

decomposed into street address number, street name

and street type, street pre®x, and street suf®x. Since

the street name and street type are separate ®elds in

the CLONES table, we had to split the output returned

by ISI Edit ourselves. Table 4 shows a few sample

records that were standardized using ISI Edit.

Although our attempt to provide input validation

was successful we still had to clean the data already

residing in the database. We utilized the standardiza-

tion capabilities of the ISI Edit tool and we created an

application that scans the entire table in the database

and standardizes each record. If the original and the

standardized record were not identical then we needed

to take action. At this point there were two venues to

take: Either proceed in batch mode and replace the

original record with the standardized one, or present

the result to a user who chooses whether to update the

record or not. We chose the latter approach to visually

examine the differences between the original and the

standardized record.

Schema manipulations for quality metrics. To be

able to assess the quality of the data in a table, we

performed some simple additions to the schema, to

ease the creation of metrics for quality:

* Addition of expiration date column: Some

addresses in CLONES may have a temporary

nature. For example, when a new subdivision is

being built and no street names have been assigned

yet, the data that is entered into CLONES is

temporary (e.g., BLOCK 5, LOT 3) since no better

description of the location exists when a location

code is created. In such cases, the person entering

data into CLONES can specify an expiration date

for this particular CLLI code. Later on, when a real

street name and number have been given to that

location, it is very easy to query for expired

addresses, and to correct them. In addition, one can

®nd out very easily how many addresses have

expired, thus getting some metrics on the quality of

the data.* Addition of quality column: When new location

information is entered into CLONES, the person

entering it, knows how precise the information is

compared with other addresses already in the

database. This way, a new column is created into

the schema of the address table in CLONES which

identi®es the quality of the information entered.

Then it is easy to assess the data quality of the table

by issuing a straightforward SQL statement.

Data replication. We used Microsoft SQL Server

6.5 Replicator to experiment with data that are entered

in CLONES and then are sent to LOC. In our prototype

implementation we concentrated on data that populate

the SITE table in LOC. This table contains informa-

tion originating from four tables in CLONES.

In general, the convention for replicating from one

database to another is via a one-to-one mapping. SQL

Server provides an easy way to replicate on a one-to-

many scenario where the primary table is replicated to

many secondary sites. However, in our particular

example we faced a completely different situation:

Several tables in CLONES are mapped into a single

table in LOC. That is, we have a many-to-one scenario

to deal with. MS SQL Server provides for many

options regarding what to replicate, (selected col-

umns, rows, or combination of both) and how often to

replicate (instantly, periodically, every x number of

transactions on the primary copy, etc.) However, it is

not straightforward to combine a set of tables for

replication into a single one. One way would be to

create a view on CLONES and replicate the view;

however, only base tables can be replicated, not

views. The approach we took was to replicate each

table in CLONES to LOC and reconstruct the SITE

Table 4. Illustration of data cleanup

Input data Output data

Fleet National Association FLEET NATIONAL ASSOC

1033 North Tower Hill Road 1033 N TOWER HILL RD

Morris, New York 10011 MORRIS, NY 10011

Hasbro Corporation HASBRO CORP

1033 Winding Lane North 1033 WINDING LN N

Sparta, NJ 07999 SPARTA, NJ 07999


table in LOC. This approach worked ®ne and

replication proved to be working in a very satisfactory

way. During the initial synchronization phase, all data,

(about 30 Mb) were sent to LOC without any

problems. Subsequent updates on CLONES are

immediately propagated into LOC with no time

delay. In short, the replication technology proved to

be mature enough for our purposes.

The following are the main ®ndings for the pilot

project effort:

* Front-end data validation using ISI Edit worked

well.* Data was successfully entered into the

``CLONES'' database.* Data was successfully replicated to the ``LOC''

database via MS SQL Server Replicator.

4. Lessons Learned and SuggestedApproach

Our interactions with a large population of customers

and working with numerous tools and methodologies

have taught us the following lessons:

* Data quality should not be discussed in vaccuum.

Many other ``non-data'' issues such as platforms,

processes, and software architectures must be

taken into account while considering data quality.

This observation has been articulated in the

literature [Orr 1998].* In discussions with end users and customers about

data quality, the data quality issues get intermixed

with other issues such as software quality and

quality of service.* Despite increased attention, enterprise data quality

is not a well established and recognized corporate

activity with concomitant corporate responsibility.

Without such responsibility, it is dif®cult to

improve the quality of enterprise data* In terms of priority, we have found that more

problems were related to process than data. For

example, many high priority problems were not

related to the quality of the data itself but were

instead concerned with the process and platform

issues.

* Different organizations naturally have different

data that is of primary concern for data quality. In

the telecom industry, it happens to be location data

that shows the location of customers, equipment,

and work centers.* Our observations contradict Ken Orr's assertion

``Data quality: use it (the data) or lose it'' (Orr,

1998). We have found that critical data is used

frequently but is less apt to be of higher quality

because people tend to create more copies of

critical data and thus create data quality problems.* For continuous improvement of data quality, it is

essential that data quality procedures be meshed

with existing corporate practices (i.e., data quality

should not be a separate activity that may or may

not be conducted). We suggest that integration of

data quality with enterprise data architectures is an

appropriate aproach.* Issues are not all technical. We found several

issues that are purely administrative. In particular,

an owner of data quality must be appointed to

handle the administrative issues.* Different views on data quality exist and expecta-

tions vary widely among users. In addition, many

hidden costs (e.g., personnel training) are encoun-

tered and should be taken into account.* The COTS technologies address different parts of

the problem. However, complete solutions require

integration of several technologiesÐa non-trivial

task.

Our approach, based on these observations and

lessons, suggests a comprehensive data quality

methodology. In addition, we have synthesized the

main ideas from the extant literature to augment our

thinking. For example, Wang (1998) has developed

the concepts, principles, and procedures for de®ning,

measuring, analyzing and improving information

products. Based on these and cumulated research

efforts, Wang presents a Total Data Quality

Management (TDQM) methodology, and illustrates

how this methodology can be applied in practice.

TDQM consists of ®ve steps; de®ne, measure,

analyze, and improve. However, it is not clear how

this methodology can be integrated with the day-to-

day corporate activities of, say, data architectures. On

the other hand, Orr (1998) emphasizes that the

primary focus of data quality projects is to increase

the internal controls involved in entering and editing

data. These efforts, according to Orr, are doomed to


fail, as are one-shot attempts to clean up data, unless

the use of of that data is increased. Orr contends ``data

quality: use it or lose it'', i.e., if an organization wants

to improve data quality, it needs to ensure that there is

stringent use of each data element because use-based

data quality provides a theoretically sound and

practically achievable means to improve data quality.

Although we have found, in our practical experience,

that highly used data tends to become inconsistent

more frequently ( people create multipe copies), we do

agree that data quality should not be conducted in

vaccuum and should be integrated with other

activities to increase use.

The basic premise of our approach is that data

quality should be integrated with data architecture

activities and data architecture should be planned with

quality in mind. In particular, all activities in data

architecture (data requirements, data modelling,

partitioning, allocation, data ¯ows and access/inter-

operability) must keep data quality considerations at

high priority. Our approach, depicted in Fig. 4, is

presented as a sequence of steps that can be used as a

checklist for a comprehensive data quality strategy.

This approach is designed to address the large scale

data quality problem discussed in Section 3. The

following discussion brie¯y reviews these steps with

particular attention to data quality.

Step 1: Information Requirements andModeling for Data Quality

Data quality must start with business requirements

that drive the technology requirements. Examples of

business requirements for prevention and ameliora-

tion of data quality include reduction of cost of doing

business and minimization of bad exposure.

The information requirements specify the data, the

natural relationships between data, the transactions to

access and manipulate the data, and software platform

for database implementation. During this step, it is

important to develop an understanding of the

following requirements that drive data quality before

proceeding:

* Data content requirements (i.e., accuraccy, con-

sistency, currency).* Size information (number of users who will access

the database, database size).

* Response time (average, worst case) requirements.* Scaling and growth requirements.* Data security requirements.* Data availability restrictions.* Data synchronization requirements (i.e., how

frequently duplicate data should be synchronized:

immediately, hourly, daily, etc.).* Connectivity needed from user sites (e.g., from

PCs, Macs, UNIX).* Interoperability and interfaces with other (existing/

proposed) systems.* Portability between computing platforms (data

portability, processing portability, user interface

portability).* Backup/recovery needs and disaster handling.* Policy restrictions (e.g., standards and policies to

be followed, application control and auditing

restrictions).

After information requirements, a data model is

developed to represent the view of data at several

levels of detail. From a data quality point of view, data

must be modeled, named and de®ned with a singular

de®nition and uniform business rules to support all

customers. In essence, the following steps in

development of a data model should keep the data

quality in mind:

1. De®ne entities (the principal data objects about

which information is to be collected).

2. Identify relationships (the real-world associations)

between entities. Relationships between entities

can be one to one, one to many, and many to many.

3. Identify and develop attributes (characteristics of

entities).

4. Represent this information by using an ERA

diagram.

Step 2: Planning for Data Quality

Effective planning and organizing is the key to

improving data quality. This step involves the

following activities:

* Choose the most appropriate metrics andmeasurement instruments. A few key metrics

need to be selected for tracking because not all

metrics can be or should be tracked. We, for


example, assembled a list of 40 metrics that could

be used for measuring data quality. You can

customize this list based on metric priority,

method of measurement, frequency of measure-

ment, cost of measurement, risk of ignoring, and

other appropriate factors. The key issue of cost and

bene®ts of gathering metrics should be considered.

Well known sampling techniques can be used so

that a small sample can be used to determine the

key problems. Selection of proper instrumentation

is an important aspect of metrics.* Establish a solid replication strategy. Decide

what data will be replicated, why and how? This

means deciding what data will be moving between

sites and in one direction or bi-direction. Most

organizations have some choices in what pieces of

information should be placed where (i.e., how

much information will be in the replicated in your

home grown database and how frequently it will be

synchronized).* Determine a synchronization interval for dif-

ferent classes of data. A synchronization interval

is the amount of time you can afford (from a

business point of view) the data to be out of synch.

Synchronization intervals can be tight (i.e.,

transaction level) or relaxed (i.e., daily, weekly).

Fig. 4. Methodology to build quality in data architectures.


Establishing synchronization intervals can impact

data quality.* Adopt a cleanup strategy. This includes the initial

cleanup and a ``cleanup interval'' for periodic

scrubbing of data. For example, establish a strategy

to make the location addresses compliant to the

postal addresses and how frequently to repeat the

cleanup process.* Architect interfaces to other applications and

¯owthroughs. How the data will be exchanged

between different applications (open interfaces)

and how will it smoothly ¯ow. In particular, how

will the information ¯ow between different data-

bases and the back-end systems.

Establish end-user interfaces/access strategies. How

will the users access the data (try to close the back

doors). This issue was brought up in some of our

workshops.

Step 3: Data Allocation and Inter-SystemFlow Issues

Data allocation can cause duplication and thus result

in data quality problems (i.e., different copies

showing different and con¯icting information). The

following general guidelines should be used for data

quality:

* Keep data replication as minimal as possible.

Large number of replicates can cause serious

problems in two phase commit (2PC) as well as

Replication Servers.* If data must be synchronized as a transaction, then

keep the number of copies small (experience

shows not more than 3) and use 2PC.* If concurrency requirements outweigh ``subse-

cond'' data integrity requirements (i.e., data can

be synchronized periodically) then use Replication

Servers).* If the network and nodes are unreliable, then use

Replication Servers.

In addition to these guidelines, you should system-

atically study and evaluate various data allocation/

duplication strategies and eliminate unacceptable

options as quickly as possible. Discussion of data

allocation strategies is far beyond the scope of this

paper. Interested reader should refer to Umar (1997)

for details.

Intersystem data ¯ow in large scale distributed

systems is of vital importance in establishing a vable

enterprise data architectures. This issue also impacts

data quality because ¯owthroughs can cause delays

and data inconsistencies. Data ¯ow, as discussed

in Section 3.3, is a complex problem, involving

many issues that must be considered in order to

design a ¯ow that will be successful and that

will minimize data reconciliation problems. Brie¯y,

data ¯ow must be considered in the context of the

enterprise and its business goals and processes. In

particular, it is important that such a global context

be one in which unanticipated data ¯ows can be

accommodated, as systems evolve, new systems are

created, and legacy systems need to be interfaced with.

We characterize the issues for data ¯ow design into

several categories. The categories include

* Global Organization of Systems and Data (i.e.,

how should systems be logically organized

regarding data ¯ow, how should data be

logically organized regarding data ¯ow, how

should timing constraints and race conditions be

handled?).* Metadata (i.e., what standards should there be for

data, what does a system need to know about other

systems, where is this information stored and

maintained, what is required of metadata in a

global schema).* Data Validation (i.e., where should data be entered

into a set of systems, where should data be

validated in a set of systems, what should happen

if data fails a validation test).* Data Flow Enablement (i.e., what must happen

between systems to realize the data ¯ow, what is

the role of immediate vs. batch data propagation,

how should dynamic data be treated in relation to

reference data).* Access Paradigms (i.e., what should the inter-

system access paradigms be, how can we know

that the data ¯ow is successful, how do we deal

with system or network failures).* Data Replication (i.e., under what circumstances

should we replicate data, how should we manage

replicated data).* Procedures and Design Process (i.e., what should

be automated, what procedures are required for job

scheduling, how can we gracefully evolve a set of


systems, what is required of data ¯ow documenta-

tion, what is the role of people and organizations in

data stewardship).

For each category, a set of guidelines should be

developed. We discussed a few of these requirements

in Section 3.3.

Step 4: Technologies Assessment andSelection

Commercial off-the-shelf (COTS) technologies selec-

tion should be viewed in terms of data quality. In

particular, software infrastructure should be chosen

for:

* Quick and ef®cient access of data from a wide

array of users residing at different sites.* Interoperability of data between heterogeneous

systems.

Data cleanupAccess of data from a wide array of users residing at

different sites is currently being provided through

Web-technologies. The Web browsers access these

databases through the Web server that may invoke

SQL middleware. ODBC/JDBC drivers are commer-

cially available to support ubiquitous access to

heterogeneous data sources from Web browsers.

Web-based tools with ODBC/JDBC should be

considered for data access quality.

Data interpretability is greatly in¯uenced by the

exchange protocol (i.e., the format of the data and the

rules of exchange) that is used between clients and

servers. Let us ask the following question: can a

database stored in vendor X database exchange

information with vendor Y tools? In general, the

answer to this question is no because the exchange

protocols are largely proprietary at the time of this

writing (i.e., clients from vendor X can only access

vendor X database). This mismatch has led to

``database gateways'' that convert vendor X protocols

into vendor Y protocol. DRDA (Distributed Relational

Database Architecture) Gateways are an example of

such gateways (DRDA Gateways from Informix,

Oracle and other vendors convert the Informix and

Oracle protocols to DB2 protocols). ODBC/JDBC

drivers can also be used for data interoperability.

Selection of appropriate exchange protocol is also part

of the data quality work.

A diverse array of technologies, available from a

variety of vendors, can be used to address many of the

data quality problems. These technologies come with

varying price/performance ratios, offer a wide range

of features, and are available on different platforms.

Examples of the technologies of particular interest to

us are data scrubbers, data warehouse tools, replica-

tion servers, work¯ow managers and multidatabase

systems. We discussed these technologies brie¯y in

Section 3.4. As noted previously, there is no single

solution to the entire problem, however a well

balanced combination of the aforementioned technol-

ogies can alleviate the problem drastically.

Step 5: Implementation, Deployment, andSupport

An effective measurement and control process is the

key to improving data quality during the implementa-

tion, deployment and support stage. This involves:

* Keep focus on the root causes. The root causes,

may include enterprise architecture issues such as

design of data ¯ow/synchronization, architecture

of legacy products, number of manual interven-

tions needed, and availability of well trained staff.* Select and deploy most appropriate data quality

improvement tools. Data quality improvement

tools fall into the following categories:

Ð Data scrubbers that correct incomplete and

incorrect data through rules

Ð Compare and stare tools that help in making

the data consistent with reality

Ð Data replicators to keep the data consistent

We have discussed these tools in an earlier

deliverable.* Consider network issues. What type of loads and

availability requirements will be placed on the

network (several replication schemes require

highly available networks).* Re-evaluate platform issues. On what platforms

replicated data should reside, what type of middle

ware is available on these platform (ties into

COTS). We have reviewed platform issues in the

previous step.


* Establish Quality Control. After a metric process

has been established and implemented, a quality

control process must be instituted to monitor,

evaluate and control the process based on the

metrics. The focus is on detection and elimination

of special causes that appear in the process. In

particular, the process performance is predicted/

measured against:

Ð Process goals (e.g., RMAs must not exceed

100 per week)

Ð General trends (e.g., gradual improvement of

data quality)

Ð Variation control (e.g., learn root causes for

pronounced variations)

Statistical quality control (SQC) is used to detect

and eliminate unnatural causes. For example, SQC

can be used to monitor the key metrics such as

RMAs. If the RMAs for a particular system are

¯uctuating signi®cantly, then it is important to

understand the reason and determine the root

causes for this variation (a process owner should

take responsibility for SQC).* Improve Data Quality. The basic principle is to

focus on critical success factors (key metrics) and

systematically improve them by identifying root

causes and eliminating them. From a business

perspective, examples of these key metrics are

customer satisfaction, RMAs, retries, average time

to complete an order, etc. In addition to a long

range systematic process improvement that is

re¯ected in trends, unusual system behavior as

displayed by SQC charts should be considered.* Establish a change management procedure.

This includes changes in data, the metadata as

well as the underlying platforms.

Step 6: Management and Staf®ng Issues

Appropriate management initiatives coupled with

skilled and motivated staff are extremely important

in improving data quality. The following manage-

ment/staf®ng issues should be considered:

* Assign roles and responsibilities. Data owner

versus quality `ìnspectors'' roles. Basically, who

is responsible for what during the data life cycle

(data creation, design, implementation, mainte-

nance, retirement). Some companies have

established `Ènterprise data architecture groups''

that watch out for the health of corporate data

throughout its life cycle. Another idea is to treat

data management as an asset managementÐthe

managers are paid based on increase in the value of

the asset (similar to other assets in organization).* Establish a data quality owner. It is important to

identify someone who is responsible for measuring

and improving the quality of data. For example, as

the value of customer and location information

increases due to the New Telecom Bill, someone

should own the data quality improvement task to

minimize business impact. The owner may be a

person or, in the case of large projects, a team that

is responsible for data quality (this may be part of a

total quality management program). The process

owner should be viewed as manager of the data

asset and should be rewarded and reprimanded as

this asset appreciates or depreciates.* Develop organizational procedures. In parti-

cular, develop feedback loop to continuously

improve data quality (the notion of asset manage-

ment). These organizational procedures may

include a variety of items (based on business

drivers) such as determine and isolate backdoors,

do high level analysis of root causes, identi®cation

of missing information, and routing of feedback by

roles (i.e., next person to contact customer and

stewards for data of type x, etc. are ®rst in the

feedback loop). Organizational procedures should

include a reward/reprimand system.* Establish scheduling scenarios. Decide on events

and conditions that will trigger data movement,

time needed for data movement, considerations

about time zone differences. This may require

some organizational considerations.* Include a reward/reprimand system to motivate

people. Speci®cally, automated ¯owthroughs

between systems (e.g., Telcordia's LOCATION

Flowthrough product) can result in signi®cant

changes, and in some cases, elimination of

several tasks performed by the CLLI co-

ordinators and other related folks. There has to

be a reward system for these people to be

motivated about implementing work¯ows ( per-

haps responsibility shifts to monitor the process

through metrics.* Do not overlook administrative details. Take

care of additional factors such as the following:


Ð Staff training;

Ð Position descriptions (roles and responsibil-

ities);

Ð Responsibility shifts (impact on organiza-

tions);

Ð Communication/administrative issues (i.e.,

documentation).

5. Concluding Comments

We have described a case study of data quality in the

telecommunication industry and showed how require-

ments for data quality were collected, how the data

quality metrics were de®ned, what guide lines were

established for intersystem data ¯ows, what COTS

technologies were used, and what were the results of a

prototype. As a result of experience gained and

lessons learned, we have developed a comprehensive

approach that synthesizes data quality issues into an

architectural vision that is supported by a series of

steps, procedures, checklists, and tools. Our approach

takes into account the technology, process, and people

issues and can be used to design data quality as

an integral part of data architectures. In addition,

we have synthesized the main ideas from the

extant literature to augment our thinking. The

results reported in this paper are a summarization

of a series of technical reports, listed in the side-

bar ``Data Quality Project Deliverable Reports''.

In the future, we are planning to publish results

from these reports as our experience grows in this

area.

Acknowledgment

We would like to thank Josephine Micallef for her

substantial help in this project, which would not have

been complete without her tireless efforts in imple-

menting our prototype.

References

Ballou D, Pazer H. Modeling Data and Process Quality in Multi-

input, Multi-output Information Systems. Management Science1985;31(2):150±162.

Cellco J. Don't Warehouse Dirty Data. Datamation 1995;42±52.

Data QualityÐACritical Information Systems Considerations. DataManagement Review 1996.

English L. Data Quality; Meeting Customer Needs. DataManagement Review 1996;44±51.

Kaplan D, Krishnan R, Padman R, Peters J. Assessing Data Quality

in Accounting Information Systems. CACM 1998;41(2):72±78.

Orr K. Data Quality and Systems Theory. CACM 1998;41(2):66±71.

Redman T. Data Quality. Bantam Books, 1992.

Redman T. Data Quality for the Information Age. Artech House,

1992.

Redman T. The Impact of Poor Data Quality on the Typical

Enterprise. CACM 1998;41(2):79±83.

Moriarity T. Better Business Practices. Database Programming andDesign 1996;59±61.

Moriarity T. Barriers to Data Quality Part II; Business Practices.

Database Programming and Design 1996;61±63.

Tayi G, Ballou D. Examining Data Quality. CACM 1998;41(2):54±

57.

Umar A. Application (R)Engineering: Building New Web-basedApplications and Dealing with Legacies. Prentice Hall, 1997.

Wang RY. A Product Perspective on Total Data Quality

Management}. CACM 1998;41(2):58±96.

Wand Y, Wang R. Anchoring Data Quality Dimensions in

Ontological Foundations. Communications of ACM 1996;86±95.

Amjad Umar is Director of Distributed Objects

Engineering and Research Group at Telcordia

Technologies. He is also an Adjunct Professor at

Rutgers University and Stevens Institute of

Technology where he develops and teaches graduate

level courses in distributed systems, object-oriented

systems, networks, and databases. At Telcordia, he

leads projects in distributed object technologies,

advanced middleware, mobile computing, electronic

commerce, legacy data access, data integration, and

data warehouses. His previous experience includes

consulting assignments in the manufacturing industry

and faculty position at the University of Michigan. He

has authored more than 20 technical papers and three

Prentice Hall books Distributed Computing andClient-Server Systems, Application (Re)Engineering:Building Web-based Applications and Dealing withLegacies, and Object Oriented Client/Server InternetEnvironments. He has an M.S. in Computer,

Information and Control Engineering and a Ph.D. in

Information Systems (Industrial and Operations

Engineering Department) from the University of

Michigan. He is a member of ACM and IEEE.

George Karabatis is a Research Scientist at Telcordia

Technologies since 1995, where he conducts research


on heterogeneous information resources, work¯ow

systems and distributed databases. He received his

Ph.D. and M.S. in Computer Science from the

University of Houston, and his B.S. in Applied

Mathematics from the Aristotle University in

Greece. His research interests include integration of

heterogeneous systems, data warehouses, system

architectures, work¯ow systems, quality of data,

consistency maintenance between interdependent

data, and transaction management. He has authored

numerous publications in refereed journals, book

chapters, and conference proceedings in data-

base management, work¯ows and distributed

systems. He is a member of IEEE, IEEE/CS and

ACM.

Linda Ness is an Executive Director at Telcordia

Technologies. In this capacity, she oversees tactical

and strategic research projects at Telcordia. Her areas

of interest span data quality, work¯ow systems,

Internet Telephony, data warehouses, and data

integration. In her previous assignments at Telcordia

Technologies, she has managed research in data

quality and data reconciliation. She also has held

various faculty positions. She holds a Ph.D. in

Mathematics from Harvard University.


Documents

Enterprise Data Quality: A Pragmatic Approach