Upload
amjad-umar
View
217
Download
1
Embed Size (px)
Citation preview
Information Systems Frontiers 1:3, 279±301 (1999)# 2000 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Enterprise Data Quality: A Pragmatic Approach
Amjad Umar, George Karabatis, Linda Ness andBruce HorowitzTelcordia Technologies, 445 South Street, Morristown,
NJ, 07960-6439
E-mail: [email protected]
Ahmed Elmagardmid*Department of Computer Sciences, Purdue University,
West Lafayette, IN 47907
Abstract. Enterprise dataÐthe data that is created, used andshared by a corporation in conducting businessÐis a criticalbusiness resource that must be analyzed, architected andmanaged with data quality as a guiding principle. This paperpresents results, practical insights, and lessons learned from alarge scale study conducted in the telecommunications industrythat synthesizes data quality issues into an architectural andmanagement approach. We describe the real life case study andshow how requirements for data quality were collected, how thedata quality metrics were de®ned, what guidelines wereestablished for intersystem data ¯ows, what COTS (commercialoff-the-shelf ) technologies were used, and what results wereobtained through a prototype effort. As a result of experiencegained and lessons learned, we propose a comprehensive dataquality approach that combines data quality and dataarchitectures into a single framework with a series of steps,procedures, checklists, and tools. Our approach takes intoaccount the technology, process, and people issues and extendsthe extant literature on data quality.
Key Words. data quality, data quality metrics, data qualitymethodology, process ¯ow through, distributed systems
1. Introduction
Data quality has emerged as a major issue recently
due to its potential severe impact on organizational
effectiveness. For example, a leading computer
industry information service indicated that it expects
most business process reengineering initiatives to fail
due to a lack of data quality (Wand, 1996). In addition,
Redman (1998) shows how poor data quality results in
the operational impacts (e.g., lowered customer
satisfaction, increased cost, employee dissatisfaction),
tactical impacts (e.g., poorer decision making, more
dif®culties in building data warehouses, more dif®-
culties in reengineering, increased organizational
mistrust), and strategic impacts (dif®culties in setting
and executing strategies, diversion of management
attention). In some cases, in particular the ®nancial
statements, the quality of data must be certi®ably free
of certain types of errors (Kaplan, 1998). As other
examples, wrong price data in retail databases may
cost American industry as much as $2.5 billion in
overcharges annually (English, 1996) and dirty data is
causing major problems with data warehouses (i.e.,
many users are retrieving wrong information from
their data warehouses) (Cellco, 1995).
Although data quality is gaining some recognition,
views about data quality and approaches to improve
data quality differ widely (English, 1996; Wang,
1998; Orr, 1998; Redman, 1992). The fundamental
problem with the extant concepts, methodologies, and
case studies is that a linkage is not made with the
existing day-to-day activities. For example, state of
the practice in data architectures typically does not
include data quality issuesÐthus most data quality
considerations are left out of the regular day-to-day
corporate practices. There is a need to extend the data
architecture work by explicitly including the fol-
lowing two data quality considerations:
* How to measure data quality (i.e., metrics).* How to improve data quality.
*Participated in project while at Telcordia Technologies.
279
This paper presents an overall strategy that
synthesizes data quality issues into an architectural
vision that is supported by a series of steps,
procedures, checklists, and tools. This strategy
attempts to bridge multiple areas by ®rst de®ning
key concepts and outlining a framework for discus-
sion (Section 2). In Section 3, we describe an
extensive data quality case study that consists of
several interrelated activities: (a) problem analysis
that involved a series of workshops with project
collaborators from participating companies and
identi®cation/prioritization of key problems; (b)
design for the prevention and amelioration of poor
data quality and inconsistency based on data
¯owthrough and data inconsistency/data quality
metrics; (c) assessment of COTS (commercial off-
the-shelf ) and near-COTS technology that ranged
from data cleanup tools to work¯ows and multi-
database transactions/multisystem integration tech-
nologies; and (d) demonstration of the impact of
selected technology on real-life large scale data
quality problems. Section 4 proposes, based on our
experience gained and lessons learned. an extensive
approach that combines data quality and data
architectures into a single conceptual framework. In
particular, our experience has shown that the proposed
strategy is quite effective because it takes into account
the technology, process, and people issues and can be
used to design data quality as an integral part of
enterprise data. The focus of this paper is on
pragmatic issues and practical insights gained through
involvement in large scale data quality problems.
2. Data Quality Concepts and Approaches
There are different views and de®nitions of data
quality (see, for example, English, 1996; Wand, 1996;
Moriaritty, 1996). We will use the following opera-
tional de®nition of data quality (Redman, 1992):
A product, service, or datum X is of higher quality
than product, service, or datum Y if X meets
customer needs better than Y.
This de®nition appears to be generally accepted (i.e.,
the quality must be de®ned in terms of customer
satisfaction) and generally conforms to other de®ni-
tions. For example, Tayi and Ballou's (1998) statement
of ``data quality can best be de®ned as ®tness for use''
resonates with this de®nition. However, it needs
further re®nement and elaboration. For example,
Wand and Ward (1996) show 26 dimensions of data
quality, Wang (1998) de®nes quality in terms of
information product (i.e., it re¯ects the quality of the
characteristics and processes that manufacture the
information product), Orr (1998) contends that the use
of data must be increased to improve data quality, and
Larry English (1996) suggests that data quality should
be considered in terms of data de®nition, data content,
and data presentation.
We focus on the following quality attributes of
data, initially proposed by Ballou and Pazer (1985)
since they include other views:
* Accuracy: re¯ects correctness with respect to real
life.* Consistency: two or more data items do not con¯ict
with each other.* Currency: how recent is the information.* Completeness: degree to which values are present
in a data collection.
These attributes are related to the data content and
provide the core data quality metrics. Additional
important data quality attributes are:
* De®nition: re¯ects that data must be clearly and
unambiguously de®ned.* Access: shows the ease with which the users can
access the data.* Presentation: re¯ects the style with which the data
is presented.
Data quality needs to be improved for different
reasons:
* Economic reasons (i.e., how can I reduce the cost
of service).* Service quality reasons (i.e., how happy are my
users).* Process ef®ciency (i.e., maximizing throughput).* Platform ef®ciency (i.e., maximizing investment in
hardware/software).
To improve the quality of data, several assessment and
improvement activities need to take place throughout
its life cycle. The approaches to obtain data quality
fall into the following broad categories:
280 Umar, Karabatis and Ness
* Data cleanup.* Process cleanup.
Data cleanup involves use of a tool to identify ``bad
data'' (i.e., not accurate/consistent/current/complete)
and then elimination of bad data through automated
and/or manual processes. A wide variety of ``data
scrubbers'' are commercially available to perform this
task. The main limitation of this approach is that some
data cannot beeasily veri®ed tobecorrect.Forexample,
no edit can verify someone's home address. In addition,
data cleanup needs to be a periodic effort that must be
repeated several times over the life cycle of data.
Process cleanup goes beyond data cleanup and
concentrates on the activities that tend to pollute clean
data. The main activities involved in process cleanup
are:
* Establish quality metrics.* Monitor the data life cycle for quality pollution.* Use statistical quality control and process manage-
ment to maintain desired data quality.
For total quality management, both approaches (data
cleanup and process cleanup) are needed.
Large and complex systems, as shown in Fig. 1,
involve many components (e.g., the data, the
application software that operates on the data, the
underlying platforms on which the data resides and
that are used to access the data, and the overall process
for using and managing the system). Different players
(the users of the system, the managers, the business
customers, the developers, etc.) view these compo-
nents at different levels of detail. Conceptually,
different data quality views can be expressed in
terms of each component to indicate its behavior and
to improve its quality. Examples of the views, from a
data quality perspective, are:
* Quality of the data itself (e.g., accuracy, currency,
consistency, completeness).* Quality of the application software that operates on
the data (e.g., software bugs).* Quality of the platform (e.g., the performance of
the access mechanisms).* Quality of the management and operational
processes (e.g., manual interventions, errors,
delays, ¯owthroughs, user satisfaction level).
Fig. 1 shows the main components and the key data
quality metrics associated with each component. The
model categorizes these components in terms of
processes, people and technologies because each of
these categories need to be measured and improved.
We also show conceptual inter-relationships between
different levels of metrics. We use this model for our
view of data quality. This view does include several
existing views such as English (1996), Moriarity
(1996), Orr (1998), Redman (1992), Redman (1998),
Wand (1996) and Wang (1998).
3. Case Study: Data Quality inTelecommunications
This section describes a case study of how a family of
data quality problems in the telecommunications
industry were addressed through a systematic
approach. The case study is based on a project that
was funded by the Regional Bell Operating
Companies (RBOCs) to address the issues of data
reconciliation speci®cally and data quality at large.
Fig. 2 presents the project activities, discussed below,
as a work¯ow diagram.
3.1. Problem identi®cation workshopsAs a starting point, we conducted a series of one-day
workshops with project collaborators from partici-
pating companies to identify/prioritize key problems.
The workshops were attended by dozen to three dozen
staff members that included data architects, database
administrators, managers of various customer service
units, and technology consultants. Each workshop
started with an overview of the data quality issues to
introduce key concepts and to provide a common
framework for discussion. The overview was fol-
lowed by an open brainstorming session on the data
quality problems encountered by the sponsors, a
prioritization/classi®cation session to focus on the key
problems, and a detailed discussion session to quickly
analyze the high priority items.
In all, more than 80 issues were raised by the
sponsors at these workshops. In order to obtain a
meaningful view of the issues, they were categorized
into 15 broad areas that were ranked by the customers
in terms of business need (see Table 1) The overall
priority ranking of a category proportional to the
number of issues in that category, summed over all
sessions. For example, data quality metrics were of
Enterprise Data Quality 281
Fig. 1. Conceptual model for metrics.
282 Umar, Karabatis and Ness
Fig. 2. Data quality (DQ) project plan.
Table 1. Analysis of data quality problems
Broad issues Priority Category of data quality problem
*Inconsistency among systems H Data
*Process improvement and mechanization H Process
*Need for metrics H Process
*Flow through between systems H Software
*Root causes and data redundancy H Data
*System architecture and system evolution M Software
Standardization M Process
Inconsistency with reality M Data
Data input/validation M Data
Data access and security M Software
Single primary source of data M Data
Communications/administrative complexity L Process
Ownership and accountability L Process
Methodology L Process
Recovery from data inconsistencies L Software
Enterprise Data Quality 283
major concern to multiple participants in all work-
shops we conducted and were thus ranked very high.
After the workshops, we also attempted to cast the
15 broad issues into various data quality problem
categories such as data itself, the application software
that operates on the data, the underlying platforms on
which the data resides and that are used to access the
data, and the overall process for using and managing
the system (see Table 1, column 3). During this
categorization, we noted that many high priority
problems were not related to the quality of the data
itself but were instead concerned with the process and
platform issues. This was a surprise to us. In addition,
many data quality problems discussed in these
workshop were directly or indirectly related to
location data (i.e., location of customers, telecom
equipment, and work centers). This is because the
location information plays a central role in the
telecom systems.
Let us now brie¯y review the broad category of
issues that were raised in these workshops.
Inconsistency among systems. Inconsistent infor-
mation between different systems causes serious
problems. The same information does not exist in all
required systems, and even when it does, it is
represented differently in different systems. A better
way is needed to keep replicated data synchronized.
The corporate data solution of some clients was
examined and it was found that different ferent kinds
of addresses from different systems represent the
same entity. This causes business problems (25% of
the customers in some cases do not receive their bill
due to bad address). There is also a low degree of
automation in the process. This, coupled with a high
degree of repetitive data entry to multiple systems,
exposes the process to ubiquitous data reconciliation
problems. In particular, it is likely that data entry
errors will occur, causing data discrepancies between
the systems. It is also likely that some or all updates
will not be performed to a particular system, resulting
in a system containing incomplete data. The following
example illustrates this situation. When a central
of®ce (CO) is sold, if that information is not entered
promptly, then an automated inventory system will
send equipment to the new owner but the old
client pays. In addition, old client pays tax on the
equipment, as it is still carried in their database. A
single interface to all systems is needed to access
location information.
Process improvement and mechanization. The
processes need to be improved and automated for
lasting effects. The common scenario is: a manager
comes in, ®xes the databases, gets promoted, leaves,
and the problems resurface. The issue is really one of
integrating disparate processes. The customer data is
critical because it is important to quantify return on
any customer projects. The main issues discussed in
this area were:
* Our clients need to manage issues between
regulated and unregulated sides of the market.* There is a need to mine information in customer
databases that will allow our client to offer new
products to new customer segments (and to
identify those segments).* Some clients need a uni®ed customer number.
There is a need to know if someone is a customer
and how to identify them. Name and address
present several problems. Some databases use
telephone numbers, while others use circuit IDs.
There is a need to be able to ®nd all telephone
numbers and circuit IDs that are associated with a
customer.
Need for metrics. Metrics are needed to drive
behavior to justify investment and detect self-funding
opportunities. Monitoring of metrics is an important
issue in data reconciliation. For example, one should
be able to estimate the ®nancial impact due to data
reconciliation. In particular, the following issues were
raised:
* How to measure the impact of data quality
problems?* How to measure the improvement of data quality
problems?
Metrics must include data reconciliation cost
savings (e.g., labor costs, labor/error resolution,
overtime labor/retry error resolution), performance
(e.g., volume work orders, errors, backlog, average
time/error), and operations (volume of orders per
center, volume of errors per center, volume of backlog
per center, average time of handling an error vs.
¯owthrough).
284 Umar, Karabatis and Ness
Flowthrough between systems. Current ¯ow-design
is a cause of data inconsistencies. One suggested
solution was reversion to batch processing. Some
clients noted that batch processing is inconsistent with
their need to quickly propagate customer data.
Root causes and data redundancy. Software evolu-
tion can cause data quality to decrease, since there are
no procedures for making old data to be consistent
with new integrity rules. One root cause of data
inconsistencies among provisioning and operations
databases is the fact that the processes ®rst developed
as isolated islands and were gradually pieced together.
Some systems have many duplicated databases. This
is a cause of signi®cant data inconsistency.
System architecture and system evolution. Re-
dundant data architecture is a root cause of data
inconsistencies. A well designed architecture is
needed to eliminate data silos (i.e., ``smokestack
systems''). Many applications and projects do
local optimization, subverting global optimization.
Some of the reasons are:
* Unreliable synchronization technology.* Time pressures (get things up and running
quickly).* Lack of cross-application responsibility, data and
applications knowledge, and overall data architec-
ture.
There is an economic justi®cation for a better
architecture. Databases proliferate because people
can't obtain access to needed data; they create a
database and the project manager controls it.
Architectures are very closely related to ¯owthroughs
and ¯ow-design.
Standardization. Uniform valid point of entry is
needed. For example, some systems accept non-
standard format. In addition, a common format is
essential. In some cases, address information is stored
in more than 20 different data ®lesÐthe addresses are
of various types (e.g., postal, client speci®c, etc.) and
obey various rules. A uniform valid point of entry
with a single owner is needed.
Inconsistency with reality. Many information items
are inconsistent with reality. For example, it is
dif®cult to ®gure out which equipment belongs to
which user.
Data input/validation. There is a need to clean
incoming data. This includes single enforced input
interface and management of downstream changes.
Data input validation implies enforcement of stan-
dards. Thus, most of standardization issues apply
here.
Data access and security. A standard is needed for
data extracts. There is the need to access legacy
systems and a number of other databases to obtain
information. People don't know enough details about
these systems, and when they ask, they are faced with
constant problems.
Single primary source of data items. A single
repository for data items is needed. In particular, this
represents the need for reconciliation of metadata, by
having a single system act as a steward. A single
repository would provide the de®nitive source of
metadata information for all the corporate data in the
various systems.
Communications/administrative complexity. Soft-
ware releases and upgrades can impact other systems.
There is a need for a repository for proposed
changes that must be adequately documented.
Ownership and accountability. Reference data
should not reside in individual applications.
Owner(s) should be identi®ed and propagated as
needed from a single source. Data ownership is
needed for data elements.
Methodology. A methodology is needed for con-
tinuous quality improvement. Some of the items
mentioned were: how to integrate data quality metrics
and process, a forum is needed to discuss data quality
issues, need coordination of existing efforts, need to
integrate data quality into applications, and improve
processes. During the implementation of process
among various systems, one client needs to do a
mapping between the databases and the data models.
They require support for technical and implementa-
tion issues and performance, and they need to quantify
Enterprise Data Quality 285
the economics of the applications. A methodology is
needed for continuous quality improvement.
Recovery from data inconsistencies caused by soft-ware bugs. Software bugs cause data problems that
sometimes never get cleaned up. No procedures for
data cleanup (i.e., recovery) are included with
software patches.
3.2. Data quality metricsAfter the problem identi®cation, we decided to
quickly focus on data quality metrics since it was
rated the highest in the workshop.
As shown in Fig. 1, and discussed previously,
software systems involve many components (e.g., the
data, the application software that operates on the
data, the underlying platforms on which the data
resides, and the overall process for using and
managing the system). Conceptually, a large number
of metrics can be kept for each component to measure
its behavior and to monitor its improvement.
Examples of the metrics, from a data quality point
of view, are:
* Metrics of the data itself (e.g., correctness,
currency, consistency, completeness).* Metrics of the application software that operates on
the data (e.g., software bugs).* Metrics of the platform (e.g., the performance of
the access mechanisms).
Candidate metrics for data quality
a. Customer Related Metrics
* Customer satisfaction (user view)
b. Data Metrics
* Data accuracy (e.g., wrong V&H coordinates,
wrong pairs available)* Data currency* Data completeness* Data consistency (same ®eld in different systems)* Which data items are the dirtiest?* Data usage (which data items get used most often
by whom)
c. Platform Metrics
* Storage occupied* Data access: ease of use* Data access: response time* Platform reliability
d. Application Software Metrics
* Changes in data structures (e.g., ®eld lengths)* Number of changes made to software due to
changes in real world (e.g., changes in copper
facilities)* Application software bugs
e. Process Metrics
* Frequency of manual input (normal process, not
request for manual access (RMA).)* Errors between systems (i.e., OK for a system but
an error when passed along to another one* Money (business) lost due to wrong quotes* Percentage of held orders that can be attributed to
bad inventory data* Field technician time wasted due to bad data
(addresses, inventory)* Accuracy of property records for tax purposes* Accuracy of location data for tax/tariff
purposes* Number of telephone calls (RMAs) produced due
to incomplete data supplied by ®eld engineers* Number of calls received from customers due to
missing information in the forms* Number of sources of information looked at before
a customer question can be answered* Number of times a question is generated due to bad
data* Number of times ( percentage) the RMAs were not
handled correctly (i.e., second order RMAs)* Retries due to staff errors* Successful ¯owthroughs and things done in less
time than expected* Fallouts in end-to-end ¯owthroughs for provi-
sioning* Critical dates met versus missed* Labor/error resolution* Overtime labor/retry error resolution* Volume of service orders by circuit id* Errors in dealing with service orders* Backlogs* Average time/error* Volume of orders per center
286 Umar, Karabatis and Ness
* Volume of errors per center* Volume of backlog per center* Metrics of the management and operational
processes (e.g., RMA, errors, delays, ¯owthroughs,
user satisfaction level)
Based on the feedback from the Requirement
Solicitation Workshops, response to the survey
questionnaire, and general literature surveys, we
developed a set of key metrics for these components.
The resulting metrics are listed in the sidebar
``Candidate Metrics for Data Quality''. Naturally,
not all of these 40 metrics are neededÐyou need to
reduce this list based on metric priority, method of
measurement, frequency of measurement, cost of
measurement, risk of ignoring, and other appropriate
factors. The following table (Table 2) shows how you
can select data quality metrics. The main idea is to
focus on those metrics that have the highest business
impact. The best candidates for selection are those
metrics that have high priority, low cost of measure-
ment and high risk of ignoring. We are illustrating the
use of this table through a few metrics. We used a
similar table to evaluate and select the most
appropriate metrics out of the 40 we identi®ed.
After a metric process has been established and
implemented, a quality control process must be
instituted to monitor, evaluate and control the process
based on the metrics. The focus is on detection and
elimination of special causes that appear in the
process. In particular, the process performance is
predicted/measured against:
* Process goals (e.g., manual interventions must not
exceed 100 per week).* General trends (e.g., gradual improvement of data
quality).
* Variation control (e.g., learn root causes for
pronounced variations).
Numerous examples of quality control in manu-
facturing can be found. For example, many
manufacturing organizations target 100 defects per
million parts (Motorola has targeted 3 defects per
million). The main challenge is to monitor and manage
variations against targeted goals for improvement.
3.3. Intersystem data ¯owIntersystem data ¯ow in large scale telecom systems
was ranked as one of the most important problems in
the workshops. In a data ¯ow process, data is
generally received by a system (either from one or
more other systems, or directly, as input data),
processed (and perhaps altered) by that system, and
then passed to one or more other systems. The
processing system may even create new data. Data
may be committed to permanent storage ( perhaps
after transformation) within the processing system
and/or may be sent to other systems.
Although discussed frequently in the literature
Umar (1998), it is generally not the case that data in
multiple systems are accessed or modi®ed concur-
rently. In particular, read and/or write operations
under the same transaction umbrella that involve
multiple systems are seldom required in telecommu-
nications systems. However, sometimes a system may
need to request some reference data from one or more
external systems.
Typically, the same data is passed through and
processed by many systems, thus the data must be
viewed in a context larger than that of any individual
system. Such a view should include both representa-
tion (syntax) and interpretation (semantics) of
data.
Table 2. Quality metrics selection example
Quality metric
(examples)
Metric priority
(L, M, H) in
terms net cost
Method (tool)
for measurement
Frequency of
measurement
suggested
Cost of
measurement
(L, M, H)
Risk of ignoring
(L, M, H)
Selection decision
(Yes//No)
Customer dissatisfaction H Customer surveys Once a month
(or at the end of
each assignment)
H H Yes
Disk storage overruns L System utility Daily L L No
RMAs (requests for
manual assistance)
H PAWS report Daily M H Yes
Retries H PAWS report Daily M H Yes
Enterprise Data Quality 287
Data ¯ow is a complex problem, involving many
issues that must be considered in order to design a
¯ow that will be successful and that will minimize
data reconciliation problems. It is not suf®cient to
consider data ¯ow in just a local context (i.e., with
respect to the systems that a given system either
directly receives data from or directly sends data to).
Rather, it is critical that data ¯ow be considered in the
context of the enterprise and its business goals and
processes. In particular, it is important that such a
global context be one in which unanticipated data
¯ows can be accommodated, as systems evolve and
new systems are created.
It is also important that we be able to accommodate
legacy systems in the data ¯ow, even though they
might not be architected in a way that would be
desirable if we were designing them ``from scratch''.
We developed a set of recommendations for
effective data ¯ow. A few of these recommendations
are listed in the sidebar ``Sample Data Flow
Recommendations''.
3.4. Assessment of commercial technologies fordata qualityA diverse array of technologies, available from a
variety of vendors, can be used to address many of the
data reconciliation problems identi®ed during the
Workshops. These technologies come with varying
price/performance ratios, offer a wide range of
features, and are available on different computing
platforms. Examples of the technologies of particular
interest to us are data scrubbers, data warehouse tools,
replication servers, work¯ow systems and multi-
database systems.
Table 3 summarizes the results of analyzing the
potential impact of the various COTS technologies to
address the issues that were raised during the
requirements workshops. The potential impact is
quanti®ed as high, medium and low to indicate the
possible effectiveness of COTS technologies to
address the issues raised during the workshops. As
can be seen, there is no single solution to the entire
problem, however a well balanced combination of
the aforementioned technologies can alleviate the
problem drastically.
Sample data ¯ow recommendations
Major Global Organization Recommendations
* Distinct systems should have distinct purposes,
should not cause the same kinds of changes to the
same data items, and should not cause the same
kinds of physical phenomena to occur to the same
real world objects.* A system that requests modi®cations to particular
data should have a high proximity to the system
that stewards that data* A system that has a need for highly accurate data
should have high proximity to the system that
stewards that data
Table 3. Relative impact of solution technologies (L, M, H)
Issues Data warehouses Replication servers Work¯ow systems Data scrubbers Multidatabase systems
*Location problems M H M
*Inconsistency among systems M H H L L
*Process improvement and mechanization L M H
*Metrics
*Flow through between systems L L H
*Root causes and data redundancy H M
*System architecture and system evolution M M M
Standardization M H L
Inconsistency with reality M L
Data input validation L L
Data access and security L H
Single primary source of data H
Communications/administrative complexity
Ownership and accountability
Methodology H
Recovery from data inconsistencies M H
288 Umar, Karabatis and Ness
* Data should ¯ow from a system that requires higher
precision to one that requires lower precision.* Data should ¯ow from a more constrained system
to a less constrained system for that data.* Data should originate with the system that is the
steward of the data, and should ¯ow to-systems
that need to use the data.* A tree structure is the preferred organization for
data ¯ow (no loops).* Race conditions should be designed out of a data
¯ow process. This includes scenarios where the
same data is sent to a system from multiple
sources, as well as the scenario where different,
but dependent data are sent to a system from
multiple sources.* A mechanism (such as sequence numbers), should
be provided with data transmissions, and/or a
uniform format should be provided so that systems
can determine an appropriate order in which to
process the data.
Major Metadata Recommendations
* All systems that contain the same data items
should follow identical standards for the character-
istics of those data items, as well as for integrity
constraints on those data items.* The same data should be converted by multiple
systems in the same manner.* Consistency of integrity constraints, business rules,
and data models need to be considered in a global
context.
Major Data Validation Recommendations
* Any particular data item should be manually
entered into a single system.* Data should be entered into the system that is the
steward of that data* If data is entered into a target system that is not the
steward, that target system should essentially act as
an interface system to the steward, and forward the
data to the steward for entry into the stewarding
system.* Data should be validated within each system that
the data resides in a ¯owthrough process.* Data may need to be validated, relative to a business
rule, as a set, and not just as individual facts.* Data needs to be either rejected or marked as being
invalid as soon as it is detected.
* A feedback loop to the steward would help correct
data that a downstream system considers invalid.
The steward should also be sent an indication of
why the data is invalid.* Data that is determined to be invalid should not be
passed on to other systems downstream.
Major Data Flow Enablement Recommendations
* Data that is highly dynamic should not be updated
by multiple systems.* Data that needs to be validated together should be
sent as a unit.
Major Intersystem Access Recommendations
* Distributed Transaction Processing is not currently
required as an access paradigm for data ¯ow, but it
may be appropriate to consider it at some point in
the future.* Data should never be simply sent to another system
under the assumption that it will be received, some
form of guaranteed delivery or acknowledgment
should be used.* Appropriate synchronization mechanisms should
exist to ensure either that data is received in a
correct order for processing or that the receiving
system can determine the correct order to process
incoming data.
Major Data Replication Recommendations
* Data replication should be minimized, and should
only be done for very good reasons.* If data is replicated in more than one place, then the
most accurate source of the data is the system that
stewards the primary enterprise copy of that data.* Data that is highly dynamic is a poor candidate for
replication.* Data replication may involve more than simply
producing a new version of a database.
Major Procedures and Design Process Recommend-
ations
* Automation of data ¯ow should be implemented
where it is economically feasible to do so* Evolution should never occur without the knowl-
edge and participation of representatives from
other systems that could potentially be affected.
Enterprise Data Quality 289
* There needs to be a carefully managed schedule for
phasing in new systems or new versions of systems.
Data Warehouse (DW) tools can be utilized in a
variety of ways within the framework of data quality.
Although DWs are voluminous and cannot replace
day-to-day operational systems due to performance
overhead, they can be used as a point of reference for
other systems. DWs help in ``inconsistency among
systems''. DWs can also provide some standardiza-
tion that other systems can refer to, since they usually
contain information from several other operation
systems. In addition, DWs can help to a limited
extent in process improvement and mechanization
because they contain a lot of informational context.
Replication servers can alleviate the inconsistency
among systems in cases where there is replicated
information stored in various systems. They can help
in process improvement due to their automatic way of
propagating information. They are also designed to
support a scheme with a primary source of data and
secondary replicas. Flowthroughs are improved in
the presence of a replication server, and they
provide an excellent tool to maintain consistent
copies of replicated data. Some replication servers
support conditional replication which is extremely
useful in a telecommunication company. Speci®-
cally, one can decide how often the replicated
data will be updated and under what speci®c
conditions.
Work¯ow systems mainly help in designing ¯ows
among various systems. They contribute to Process
Improvement and Automation because they provide
tools to change ¯ows and re-direct them among
various systems easily, with no signi®cant program-
ming effort. Updating ¯owthroughs can become more
effective with the presence of a work¯ow system.
Work¯ow systems can support system evolution
speci®cally due to their ¯exibility of ¯ow re-design.
Data scrubbers provide useful tools for data entry
validation and intrasystem cleanup, one at a time. This
capability of validation provide a means of standar-
dizing interfaces and help minimize inconsistency
among systems.
Multidatabase technology can be used in a variety
of facets in data reconciliation. Particularly they are
useful as ``middleware'' that provide connectivity
between systems. They can help in access and
authorization of systems.
Although the extant COTS technologies exhibit
a great deal of potential to address many of the
issues raised during the workshops, the full bene®ts
may be somewhat limited due to the following
limitations:
1. In general, data reconciliation requires a mixture of
COTS. Many COTS do not work with each other
(i.e., work¯ow systems do not work with replica-
tion servers). Integration of COTS is an area of
future work.
2. Many existing tools do not work on legacy data
sources such as IMS, VSAM, etc. However, many
RBOC data reconciliation problems require legacy
data reconciliation.
3. Work¯ows with added capabilities for distributed
transaction processing are needed to deal with the
issues of long (Saga) transactions.
4. Web interfaces to existing tools are just beginning
to emerge (e.g., Web access to work¯ows, Web
access to data warehouses). This area needs to
mature somewhat to be effective.
5. A potential area of future work is development of
an intelligent, Web-based interface that invokes
different technologies (e.g., data scrubbers, repli-
cation servers, work¯ows, multidatabase middle-
ware, data warehouse tools) based on a problem
type.
3.5. Implementation: the pilot projectTo demonstrate how some of the technologies can
work together to solve large scale data quality
problems, we created a pilot project. Its purpose was
to design and implement a prototype to improve the
quality of data being used in operational systems of
telecommunication applications. The pilot project
evolved around two existing Telcordia operational
systems: CLONES and LOC.
CLONES is a system responsible for creating
unique codes to name and identify various entities
such as equipment, building locations, etc. The entire
system is built on top of a relational database
(INFORMIX in Unix environments) which manages
these unique codes on behalf of Telcordia clients.
LOC is another Bellcore system comprising of
applications built on top of a relational database (DB2
on MVS) that stores and maintains location informa-
tion. In particular, LOC contains geographical
location information about planning and forecasting
areas, telecommunication equipment inventory, ven-
290 Umar, Karabatis and Ness
dors, customers, etc. LOC also stores relationships
among these locations.
When a new entity is given a location code in
CLONES, it must also be inserted into LOC, so there is
an information ¯ow from CLONES to LOC regarding
locations. However, CLONES is a separate applica-
tion that does not have detailed information about
locations, regarding a new entity. This information is
added in LOC but often does not exist in CLONES.
Since CLONES and LOC are independently updated,
the databases can become inconsistent. The conse-
quence of such type of low data quality becomes
evident when some application gets invalid informa-
tion from either CLONES or LOC and continues based
on erroneous or invalid information. In this Section
we will outline the replication and data scrubbing
technologies we used to improve the data quality in
this environment.
Setting up the environment. Since we did not want
to interfere with the daily operations of the actual
CLONES and LOC while they were on operation, we
decided to get copies of these databases and set up our
pilot system on these copies. This allowed us more
¯exibility with experiments such as schema changes,
workarounds, etc. on copies of the actual data stored in
our local PC databases. Since CLONES is imple-
mented on an INFORMIX database running on UNIX,
and LOC is on a DB2 running on MVS, we had to get
the CLONES and LOC schemas as well as data
extracts out of the original databases. We also chose to
use a commercial replicator between CLONES and
LOC. For the purposes of the pilot project Microsoft
SQL Server 6.5 proved to be adequate to carry out the
experiment since it comes with a data replicator that
can send data to another MS SQL Server, or to another
database that has 32 bit ODBC drivers compatible
with those of MS SQL Server (for example ORACLE).
Therefore we created extracts from CLONES and LOC
which we inserted into two separate SQL Server
databases running on different PCs.
* CLONES: Data for several tables were extracted as
ASCII delimited ®les and were sent to our
environment via FTP. Then they were entered
into the SQL Server in a schema that was identical
to the INFORMIX one, using a ``bulk copy'' utility
of SQL Server. However, dates were creating
errors and we had to enter the date columns as
strings. Translation back to dates could be done
later on.* LOC: Data was extracted from the LOC database,
which is a DB2 database on MVS. However,
several tables in LOC contained columns that
represented data that was not present in CLONES.
Fig. 3 depicts the overall architecture of the pilot
project. We addressed data quality in three major
categories: Input validation, data scrubbing and data
replication. We added an input validator on top of
CLONES and LOC in order to clean new data entries.
We also addressed ``dirty'' data already existing in the
databases through a data scrubber. Once the data in
CLONES had better quality it was replicated into LOC
automatically. However, LOC was adding more
columns to the data and has more detailed information
about locations. To avoid the case of overwriting such
additional information through automatic replication
from CLONES we also sent such data back to
CLONES to update its records with more recent and
correct information. In the following we will describe
in more detail each tool that we used to improve data
quality of those systems.
Input validation, standardization and scrubbing.
One of the major sources to ``dirty'' the data in the
databases is due to human error, especially when data is
being entered into the system. Our philosophy was to
``trap'' such occurrences as early as possible. For this
reason, we decided to ®lter and standardize name and
address information after being entered by the user, and
before being stored into the database, thus providing
input validation. Therefore, we prototyped one of the
CLONES data entry screensÐthe screen used to insert a
building code into CLONES. The user enters the
building name and address in free format. The name
and address are then passed to a software scrubbing tool
for decomposition into their constituent components,
and standardization into a predetermined format.
The use of standardization tools ensures that
address information in the database is entered in a
prede®ned way and minimizes human input errors.
For example, ``Main St.'' and ``Main Street''
represent the same entity, however, from the database
viewpoint they could create two separate rows. In
cases that keys were automatically created (such as
surrogate keys) the situation could easily lead into
Enterprise Data Quality 291
unwanted cases of having two distinct keys repre-
senting the same entity.
We used a commercial tool for data standardization
from Innovative Systems, Inc. The ISI Edit tool
standardizes and decomposes names and address into
components that in turn are used in SQL statements to
query or update the database. The ISI Edit routine
classi®es an input line into one of 12 line types. For
our prototype, the line types of interest were:
``O''ÐOrganizational (used for building name,
e.g., ``Premier Widget Corp'').
``S''ÐStreet (the address line containing a street
name, e.g., ``445 South Street'').
Fig. 3. Final pilot project architecture.
292 Umar, Karabatis and Ness
``C''ÐCity (the address line containing the city,
e.g., ``Morristown, New Jersey 07960'').
For each input line, ISI Edit returns a standardized
line. ISI Edit standardizes organizational suf®xes
(which may appear in the building name), street types,
directional pre®xes, directional suf®xes, subaddress
identi®ers, rural identi®ers, box identi®ers, building
identi®ers, states and provinces.
For certain line types (including ``S'', and ``C''),
the system also returns ®xed-®eld data elements and
codes. For example, the street line (i.e., type ``S'') is
decomposed into street address number, street name
and street type, street pre®x, and street suf®x. Since
the street name and street type are separate ®elds in
the CLONES table, we had to split the output returned
by ISI Edit ourselves. Table 4 shows a few sample
records that were standardized using ISI Edit.
Although our attempt to provide input validation
was successful we still had to clean the data already
residing in the database. We utilized the standardiza-
tion capabilities of the ISI Edit tool and we created an
application that scans the entire table in the database
and standardizes each record. If the original and the
standardized record were not identical then we needed
to take action. At this point there were two venues to
take: Either proceed in batch mode and replace the
original record with the standardized one, or present
the result to a user who chooses whether to update the
record or not. We chose the latter approach to visually
examine the differences between the original and the
standardized record.
Schema manipulations for quality metrics. To be
able to assess the quality of the data in a table, we
performed some simple additions to the schema, to
ease the creation of metrics for quality:
* Addition of expiration date column: Some
addresses in CLONES may have a temporary
nature. For example, when a new subdivision is
being built and no street names have been assigned
yet, the data that is entered into CLONES is
temporary (e.g., BLOCK 5, LOT 3) since no better
description of the location exists when a location
code is created. In such cases, the person entering
data into CLONES can specify an expiration date
for this particular CLLI code. Later on, when a real
street name and number have been given to that
location, it is very easy to query for expired
addresses, and to correct them. In addition, one can
®nd out very easily how many addresses have
expired, thus getting some metrics on the quality of
the data.* Addition of quality column: When new location
information is entered into CLONES, the person
entering it, knows how precise the information is
compared with other addresses already in the
database. This way, a new column is created into
the schema of the address table in CLONES which
identi®es the quality of the information entered.
Then it is easy to assess the data quality of the table
by issuing a straightforward SQL statement.
Data replication. We used Microsoft SQL Server
6.5 Replicator to experiment with data that are entered
in CLONES and then are sent to LOC. In our prototype
implementation we concentrated on data that populate
the SITE table in LOC. This table contains informa-
tion originating from four tables in CLONES.
In general, the convention for replicating from one
database to another is via a one-to-one mapping. SQL
Server provides an easy way to replicate on a one-to-
many scenario where the primary table is replicated to
many secondary sites. However, in our particular
example we faced a completely different situation:
Several tables in CLONES are mapped into a single
table in LOC. That is, we have a many-to-one scenario
to deal with. MS SQL Server provides for many
options regarding what to replicate, (selected col-
umns, rows, or combination of both) and how often to
replicate (instantly, periodically, every x number of
transactions on the primary copy, etc.) However, it is
not straightforward to combine a set of tables for
replication into a single one. One way would be to
create a view on CLONES and replicate the view;
however, only base tables can be replicated, not
views. The approach we took was to replicate each
table in CLONES to LOC and reconstruct the SITE
Table 4. Illustration of data cleanup
Input data Output data
Fleet National Association FLEET NATIONAL ASSOC
1033 North Tower Hill Road 1033 N TOWER HILL RD
Morris, New York 10011 MORRIS, NY 10011
Hasbro Corporation HASBRO CORP
1033 Winding Lane North 1033 WINDING LN N
Sparta, NJ 07999 SPARTA, NJ 07999
Enterprise Data Quality 293
table in LOC. This approach worked ®ne and
replication proved to be working in a very satisfactory
way. During the initial synchronization phase, all data,
(about 30 Mb) were sent to LOC without any
problems. Subsequent updates on CLONES are
immediately propagated into LOC with no time
delay. In short, the replication technology proved to
be mature enough for our purposes.
The following are the main ®ndings for the pilot
project effort:
* Front-end data validation using ISI Edit worked
well.* Data was successfully entered into the
``CLONES'' database.* Data was successfully replicated to the ``LOC''
database via MS SQL Server Replicator.
4. Lessons Learned and SuggestedApproach
Our interactions with a large population of customers
and working with numerous tools and methodologies
have taught us the following lessons:
* Data quality should not be discussed in vaccuum.
Many other ``non-data'' issues such as platforms,
processes, and software architectures must be
taken into account while considering data quality.
This observation has been articulated in the
literature [Orr 1998].* In discussions with end users and customers about
data quality, the data quality issues get intermixed
with other issues such as software quality and
quality of service.* Despite increased attention, enterprise data quality
is not a well established and recognized corporate
activity with concomitant corporate responsibility.
Without such responsibility, it is dif®cult to
improve the quality of enterprise data* In terms of priority, we have found that more
problems were related to process than data. For
example, many high priority problems were not
related to the quality of the data itself but were
instead concerned with the process and platform
issues.
* Different organizations naturally have different
data that is of primary concern for data quality. In
the telecom industry, it happens to be location data
that shows the location of customers, equipment,
and work centers.* Our observations contradict Ken Orr's assertion
``Data quality: use it (the data) or lose it'' (Orr,
1998). We have found that critical data is used
frequently but is less apt to be of higher quality
because people tend to create more copies of
critical data and thus create data quality problems.* For continuous improvement of data quality, it is
essential that data quality procedures be meshed
with existing corporate practices (i.e., data quality
should not be a separate activity that may or may
not be conducted). We suggest that integration of
data quality with enterprise data architectures is an
appropriate aproach.* Issues are not all technical. We found several
issues that are purely administrative. In particular,
an owner of data quality must be appointed to
handle the administrative issues.* Different views on data quality exist and expecta-
tions vary widely among users. In addition, many
hidden costs (e.g., personnel training) are encoun-
tered and should be taken into account.* The COTS technologies address different parts of
the problem. However, complete solutions require
integration of several technologiesÐa non-trivial
task.
Our approach, based on these observations and
lessons, suggests a comprehensive data quality
methodology. In addition, we have synthesized the
main ideas from the extant literature to augment our
thinking. For example, Wang (1998) has developed
the concepts, principles, and procedures for de®ning,
measuring, analyzing and improving information
products. Based on these and cumulated research
efforts, Wang presents a Total Data Quality
Management (TDQM) methodology, and illustrates
how this methodology can be applied in practice.
TDQM consists of ®ve steps; de®ne, measure,
analyze, and improve. However, it is not clear how
this methodology can be integrated with the day-to-
day corporate activities of, say, data architectures. On
the other hand, Orr (1998) emphasizes that the
primary focus of data quality projects is to increase
the internal controls involved in entering and editing
data. These efforts, according to Orr, are doomed to
294 Umar, Karabatis and Ness
fail, as are one-shot attempts to clean up data, unless
the use of of that data is increased. Orr contends ``data
quality: use it or lose it'', i.e., if an organization wants
to improve data quality, it needs to ensure that there is
stringent use of each data element because use-based
data quality provides a theoretically sound and
practically achievable means to improve data quality.
Although we have found, in our practical experience,
that highly used data tends to become inconsistent
more frequently ( people create multipe copies), we do
agree that data quality should not be conducted in
vaccuum and should be integrated with other
activities to increase use.
The basic premise of our approach is that data
quality should be integrated with data architecture
activities and data architecture should be planned with
quality in mind. In particular, all activities in data
architecture (data requirements, data modelling,
partitioning, allocation, data ¯ows and access/inter-
operability) must keep data quality considerations at
high priority. Our approach, depicted in Fig. 4, is
presented as a sequence of steps that can be used as a
checklist for a comprehensive data quality strategy.
This approach is designed to address the large scale
data quality problem discussed in Section 3. The
following discussion brie¯y reviews these steps with
particular attention to data quality.
Step 1: Information Requirements andModeling for Data Quality
Data quality must start with business requirements
that drive the technology requirements. Examples of
business requirements for prevention and ameliora-
tion of data quality include reduction of cost of doing
business and minimization of bad exposure.
The information requirements specify the data, the
natural relationships between data, the transactions to
access and manipulate the data, and software platform
for database implementation. During this step, it is
important to develop an understanding of the
following requirements that drive data quality before
proceeding:
* Data content requirements (i.e., accuraccy, con-
sistency, currency).* Size information (number of users who will access
the database, database size).
* Response time (average, worst case) requirements.* Scaling and growth requirements.* Data security requirements.* Data availability restrictions.* Data synchronization requirements (i.e., how
frequently duplicate data should be synchronized:
immediately, hourly, daily, etc.).* Connectivity needed from user sites (e.g., from
PCs, Macs, UNIX).* Interoperability and interfaces with other (existing/
proposed) systems.* Portability between computing platforms (data
portability, processing portability, user interface
portability).* Backup/recovery needs and disaster handling.* Policy restrictions (e.g., standards and policies to
be followed, application control and auditing
restrictions).
After information requirements, a data model is
developed to represent the view of data at several
levels of detail. From a data quality point of view, data
must be modeled, named and de®ned with a singular
de®nition and uniform business rules to support all
customers. In essence, the following steps in
development of a data model should keep the data
quality in mind:
1. De®ne entities (the principal data objects about
which information is to be collected).
2. Identify relationships (the real-world associations)
between entities. Relationships between entities
can be one to one, one to many, and many to many.
3. Identify and develop attributes (characteristics of
entities).
4. Represent this information by using an ERA
diagram.
Step 2: Planning for Data Quality
Effective planning and organizing is the key to
improving data quality. This step involves the
following activities:
* Choose the most appropriate metrics andmeasurement instruments. A few key metrics
need to be selected for tracking because not all
metrics can be or should be tracked. We, for
Enterprise Data Quality 295
example, assembled a list of 40 metrics that could
be used for measuring data quality. You can
customize this list based on metric priority,
method of measurement, frequency of measure-
ment, cost of measurement, risk of ignoring, and
other appropriate factors. The key issue of cost and
bene®ts of gathering metrics should be considered.
Well known sampling techniques can be used so
that a small sample can be used to determine the
key problems. Selection of proper instrumentation
is an important aspect of metrics.* Establish a solid replication strategy. Decide
what data will be replicated, why and how? This
means deciding what data will be moving between
sites and in one direction or bi-direction. Most
organizations have some choices in what pieces of
information should be placed where (i.e., how
much information will be in the replicated in your
home grown database and how frequently it will be
synchronized).* Determine a synchronization interval for dif-
ferent classes of data. A synchronization interval
is the amount of time you can afford (from a
business point of view) the data to be out of synch.
Synchronization intervals can be tight (i.e.,
transaction level) or relaxed (i.e., daily, weekly).
Fig. 4. Methodology to build quality in data architectures.
296 Umar, Karabatis and Ness
Establishing synchronization intervals can impact
data quality.* Adopt a cleanup strategy. This includes the initial
cleanup and a ``cleanup interval'' for periodic
scrubbing of data. For example, establish a strategy
to make the location addresses compliant to the
postal addresses and how frequently to repeat the
cleanup process.* Architect interfaces to other applications and
¯owthroughs. How the data will be exchanged
between different applications (open interfaces)
and how will it smoothly ¯ow. In particular, how
will the information ¯ow between different data-
bases and the back-end systems.
Establish end-user interfaces/access strategies. How
will the users access the data (try to close the back
doors). This issue was brought up in some of our
workshops.
Step 3: Data Allocation and Inter-SystemFlow Issues
Data allocation can cause duplication and thus result
in data quality problems (i.e., different copies
showing different and con¯icting information). The
following general guidelines should be used for data
quality:
* Keep data replication as minimal as possible.
Large number of replicates can cause serious
problems in two phase commit (2PC) as well as
Replication Servers.* If data must be synchronized as a transaction, then
keep the number of copies small (experience
shows not more than 3) and use 2PC.* If concurrency requirements outweigh ``subse-
cond'' data integrity requirements (i.e., data can
be synchronized periodically) then use Replication
Servers).* If the network and nodes are unreliable, then use
Replication Servers.
In addition to these guidelines, you should system-
atically study and evaluate various data allocation/
duplication strategies and eliminate unacceptable
options as quickly as possible. Discussion of data
allocation strategies is far beyond the scope of this
paper. Interested reader should refer to Umar (1997)
for details.
Intersystem data ¯ow in large scale distributed
systems is of vital importance in establishing a vable
enterprise data architectures. This issue also impacts
data quality because ¯owthroughs can cause delays
and data inconsistencies. Data ¯ow, as discussed
in Section 3.3, is a complex problem, involving
many issues that must be considered in order to
design a ¯ow that will be successful and that
will minimize data reconciliation problems. Brie¯y,
data ¯ow must be considered in the context of the
enterprise and its business goals and processes. In
particular, it is important that such a global context
be one in which unanticipated data ¯ows can be
accommodated, as systems evolve, new systems are
created, and legacy systems need to be interfaced with.
We characterize the issues for data ¯ow design into
several categories. The categories include
* Global Organization of Systems and Data (i.e.,
how should systems be logically organized
regarding data ¯ow, how should data be
logically organized regarding data ¯ow, how
should timing constraints and race conditions be
handled?).* Metadata (i.e., what standards should there be for
data, what does a system need to know about other
systems, where is this information stored and
maintained, what is required of metadata in a
global schema).* Data Validation (i.e., where should data be entered
into a set of systems, where should data be
validated in a set of systems, what should happen
if data fails a validation test).* Data Flow Enablement (i.e., what must happen
between systems to realize the data ¯ow, what is
the role of immediate vs. batch data propagation,
how should dynamic data be treated in relation to
reference data).* Access Paradigms (i.e., what should the inter-
system access paradigms be, how can we know
that the data ¯ow is successful, how do we deal
with system or network failures).* Data Replication (i.e., under what circumstances
should we replicate data, how should we manage
replicated data).* Procedures and Design Process (i.e., what should
be automated, what procedures are required for job
scheduling, how can we gracefully evolve a set of
Enterprise Data Quality 297
systems, what is required of data ¯ow documenta-
tion, what is the role of people and organizations in
data stewardship).
For each category, a set of guidelines should be
developed. We discussed a few of these requirements
in Section 3.3.
Step 4: Technologies Assessment andSelection
Commercial off-the-shelf (COTS) technologies selec-
tion should be viewed in terms of data quality. In
particular, software infrastructure should be chosen
for:
* Quick and ef®cient access of data from a wide
array of users residing at different sites.* Interoperability of data between heterogeneous
systems.
Data cleanupAccess of data from a wide array of users residing at
different sites is currently being provided through
Web-technologies. The Web browsers access these
databases through the Web server that may invoke
SQL middleware. ODBC/JDBC drivers are commer-
cially available to support ubiquitous access to
heterogeneous data sources from Web browsers.
Web-based tools with ODBC/JDBC should be
considered for data access quality.
Data interpretability is greatly in¯uenced by the
exchange protocol (i.e., the format of the data and the
rules of exchange) that is used between clients and
servers. Let us ask the following question: can a
database stored in vendor X database exchange
information with vendor Y tools? In general, the
answer to this question is no because the exchange
protocols are largely proprietary at the time of this
writing (i.e., clients from vendor X can only access
vendor X database). This mismatch has led to
``database gateways'' that convert vendor X protocols
into vendor Y protocol. DRDA (Distributed Relational
Database Architecture) Gateways are an example of
such gateways (DRDA Gateways from Informix,
Oracle and other vendors convert the Informix and
Oracle protocols to DB2 protocols). ODBC/JDBC
drivers can also be used for data interoperability.
Selection of appropriate exchange protocol is also part
of the data quality work.
A diverse array of technologies, available from a
variety of vendors, can be used to address many of the
data quality problems. These technologies come with
varying price/performance ratios, offer a wide range
of features, and are available on different platforms.
Examples of the technologies of particular interest to
us are data scrubbers, data warehouse tools, replica-
tion servers, work¯ow managers and multidatabase
systems. We discussed these technologies brie¯y in
Section 3.4. As noted previously, there is no single
solution to the entire problem, however a well
balanced combination of the aforementioned technol-
ogies can alleviate the problem drastically.
Step 5: Implementation, Deployment, andSupport
An effective measurement and control process is the
key to improving data quality during the implementa-
tion, deployment and support stage. This involves:
* Keep focus on the root causes. The root causes,
may include enterprise architecture issues such as
design of data ¯ow/synchronization, architecture
of legacy products, number of manual interven-
tions needed, and availability of well trained staff.* Select and deploy most appropriate data quality
improvement tools. Data quality improvement
tools fall into the following categories:
Ð Data scrubbers that correct incomplete and
incorrect data through rules
Ð Compare and stare tools that help in making
the data consistent with reality
Ð Data replicators to keep the data consistent
We have discussed these tools in an earlier
deliverable.* Consider network issues. What type of loads and
availability requirements will be placed on the
network (several replication schemes require
highly available networks).* Re-evaluate platform issues. On what platforms
replicated data should reside, what type of middle
ware is available on these platform (ties into
COTS). We have reviewed platform issues in the
previous step.
298 Umar, Karabatis and Ness
* Establish Quality Control. After a metric process
has been established and implemented, a quality
control process must be instituted to monitor,
evaluate and control the process based on the
metrics. The focus is on detection and elimination
of special causes that appear in the process. In
particular, the process performance is predicted/
measured against:
Ð Process goals (e.g., RMAs must not exceed
100 per week)
Ð General trends (e.g., gradual improvement of
data quality)
Ð Variation control (e.g., learn root causes for
pronounced variations)
Statistical quality control (SQC) is used to detect
and eliminate unnatural causes. For example, SQC
can be used to monitor the key metrics such as
RMAs. If the RMAs for a particular system are
¯uctuating signi®cantly, then it is important to
understand the reason and determine the root
causes for this variation (a process owner should
take responsibility for SQC).* Improve Data Quality. The basic principle is to
focus on critical success factors (key metrics) and
systematically improve them by identifying root
causes and eliminating them. From a business
perspective, examples of these key metrics are
customer satisfaction, RMAs, retries, average time
to complete an order, etc. In addition to a long
range systematic process improvement that is
re¯ected in trends, unusual system behavior as
displayed by SQC charts should be considered.* Establish a change management procedure.
This includes changes in data, the metadata as
well as the underlying platforms.
Step 6: Management and Staf®ng Issues
Appropriate management initiatives coupled with
skilled and motivated staff are extremely important
in improving data quality. The following manage-
ment/staf®ng issues should be considered:
* Assign roles and responsibilities. Data owner
versus quality ``inspectors'' roles. Basically, who
is responsible for what during the data life cycle
(data creation, design, implementation, mainte-
nance, retirement). Some companies have
established ``Enterprise data architecture groups''
that watch out for the health of corporate data
throughout its life cycle. Another idea is to treat
data management as an asset managementÐthe
managers are paid based on increase in the value of
the asset (similar to other assets in organization).* Establish a data quality owner. It is important to
identify someone who is responsible for measuring
and improving the quality of data. For example, as
the value of customer and location information
increases due to the New Telecom Bill, someone
should own the data quality improvement task to
minimize business impact. The owner may be a
person or, in the case of large projects, a team that
is responsible for data quality (this may be part of a
total quality management program). The process
owner should be viewed as manager of the data
asset and should be rewarded and reprimanded as
this asset appreciates or depreciates.* Develop organizational procedures. In parti-
cular, develop feedback loop to continuously
improve data quality (the notion of asset manage-
ment). These organizational procedures may
include a variety of items (based on business
drivers) such as determine and isolate backdoors,
do high level analysis of root causes, identi®cation
of missing information, and routing of feedback by
roles (i.e., next person to contact customer and
stewards for data of type x, etc. are ®rst in the
feedback loop). Organizational procedures should
include a reward/reprimand system.* Establish scheduling scenarios. Decide on events
and conditions that will trigger data movement,
time needed for data movement, considerations
about time zone differences. This may require
some organizational considerations.* Include a reward/reprimand system to motivate
people. Speci®cally, automated ¯owthroughs
between systems (e.g., Telcordia's LOCATION
Flowthrough product) can result in signi®cant
changes, and in some cases, elimination of
several tasks performed by the CLLI co-
ordinators and other related folks. There has to
be a reward system for these people to be
motivated about implementing work¯ows ( per-
haps responsibility shifts to monitor the process
through metrics.* Do not overlook administrative details. Take
care of additional factors such as the following:
Enterprise Data Quality 299
Ð Staff training;
Ð Position descriptions (roles and responsibil-
ities);
Ð Responsibility shifts (impact on organiza-
tions);
Ð Communication/administrative issues (i.e.,
documentation).
5. Concluding Comments
We have described a case study of data quality in the
telecommunication industry and showed how require-
ments for data quality were collected, how the data
quality metrics were de®ned, what guide lines were
established for intersystem data ¯ows, what COTS
technologies were used, and what were the results of a
prototype. As a result of experience gained and
lessons learned, we have developed a comprehensive
approach that synthesizes data quality issues into an
architectural vision that is supported by a series of
steps, procedures, checklists, and tools. Our approach
takes into account the technology, process, and people
issues and can be used to design data quality as
an integral part of data architectures. In addition,
we have synthesized the main ideas from the
extant literature to augment our thinking. The
results reported in this paper are a summarization
of a series of technical reports, listed in the side-
bar ``Data Quality Project Deliverable Reports''.
In the future, we are planning to publish results
from these reports as our experience grows in this
area.
Acknowledgment
We would like to thank Josephine Micallef for her
substantial help in this project, which would not have
been complete without her tireless efforts in imple-
menting our prototype.
References
Ballou D, Pazer H. Modeling Data and Process Quality in Multi-
input, Multi-output Information Systems. Management Science1985;31(2):150±162.
Cellco J. Don't Warehouse Dirty Data. Datamation 1995;42±52.
Data QualityÐACritical Information Systems Considerations. DataManagement Review 1996.
English L. Data Quality; Meeting Customer Needs. DataManagement Review 1996;44±51.
Kaplan D, Krishnan R, Padman R, Peters J. Assessing Data Quality
in Accounting Information Systems. CACM 1998;41(2):72±78.
Orr K. Data Quality and Systems Theory. CACM 1998;41(2):66±71.
Redman T. Data Quality. Bantam Books, 1992.
Redman T. Data Quality for the Information Age. Artech House,
1992.
Redman T. The Impact of Poor Data Quality on the Typical
Enterprise. CACM 1998;41(2):79±83.
Moriarity T. Better Business Practices. Database Programming andDesign 1996;59±61.
Moriarity T. Barriers to Data Quality Part II; Business Practices.
Database Programming and Design 1996;61±63.
Tayi G, Ballou D. Examining Data Quality. CACM 1998;41(2):54±
57.
Umar A. Application (R)Engineering: Building New Web-basedApplications and Dealing with Legacies. Prentice Hall, 1997.
Wang RY. A Product Perspective on Total Data Quality
Management}. CACM 1998;41(2):58±96.
Wand Y, Wang R. Anchoring Data Quality Dimensions in
Ontological Foundations. Communications of ACM 1996;86±95.
Amjad Umar is Director of Distributed Objects
Engineering and Research Group at Telcordia
Technologies. He is also an Adjunct Professor at
Rutgers University and Stevens Institute of
Technology where he develops and teaches graduate
level courses in distributed systems, object-oriented
systems, networks, and databases. At Telcordia, he
leads projects in distributed object technologies,
advanced middleware, mobile computing, electronic
commerce, legacy data access, data integration, and
data warehouses. His previous experience includes
consulting assignments in the manufacturing industry
and faculty position at the University of Michigan. He
has authored more than 20 technical papers and three
Prentice Hall books Distributed Computing andClient-Server Systems, Application (Re)Engineering:Building Web-based Applications and Dealing withLegacies, and Object Oriented Client/Server InternetEnvironments. He has an M.S. in Computer,
Information and Control Engineering and a Ph.D. in
Information Systems (Industrial and Operations
Engineering Department) from the University of
Michigan. He is a member of ACM and IEEE.
George Karabatis is a Research Scientist at Telcordia
Technologies since 1995, where he conducts research
300 Umar, Karabatis and Ness
on heterogeneous information resources, work¯ow
systems and distributed databases. He received his
Ph.D. and M.S. in Computer Science from the
University of Houston, and his B.S. in Applied
Mathematics from the Aristotle University in
Greece. His research interests include integration of
heterogeneous systems, data warehouses, system
architectures, work¯ow systems, quality of data,
consistency maintenance between interdependent
data, and transaction management. He has authored
numerous publications in refereed journals, book
chapters, and conference proceedings in data-
base management, work¯ows and distributed
systems. He is a member of IEEE, IEEE/CS and
ACM.
Linda Ness is an Executive Director at Telcordia
Technologies. In this capacity, she oversees tactical
and strategic research projects at Telcordia. Her areas
of interest span data quality, work¯ow systems,
Internet Telephony, data warehouses, and data
integration. In her previous assignments at Telcordia
Technologies, she has managed research in data
quality and data reconciliation. She also has held
various faculty positions. She holds a Ph.D. in
Mathematics from Harvard University.
Enterprise Data Quality 301