89
De-duplication Technology and Practices for Integrated Child-Health Information Systems Susan M. Salkowitz, MA, MGA Salkowitz Associates, LLC Stephen Clyde, PhD Utah State University, Computer Science Department Preparation of this publication was supported by a contract from All Kids Count, a program of The Robert Wood Johnson Foundation.

De-duplication Technology and Practices for Integrated Child-Health Information Systems

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

De-duplication Technology and Practices for Integrated Child-Health Information Systems

Susan M. Salkowitz, MA, MGA Salkowitz Associates, LLC

Stephen Clyde, PhD

Utah State University, Computer Science Department

Preparation of this publication was supported by a contract from All Kids Count, a program of The Robert Wood Johnson Foundation.

ii

De-duplication Technology and Practices for Integrated Child-Health Information Systems

Susan M. Salkowitz, MA, MGA Salkowitz Associates, LLC

Stephen Clyde, PhD

Utah State University, Computer Science Department

iii

October 2003

This publication was supported by a contract from All Kids Count, a program of The Robert Wood Johnson Foundation, to Salkowitz Associates, LLC and the Computer Science Department of Utah State University. The views, content and citations reflect those of Salkowitz Associates, LLC and the Computer Science Department of Utah State University Ordering Information This publication is available online at the Public Health Informatics Institute web site, www.phii.org. Copyright © 2003 by All Kids Count, Public Health Informatics Institute. All rights reserved.

iv

Acknowledgements This is to acknowledge the participation and support of the following Connections projects, their staffs and consultants: Centers for Disease Control Janet Kelly, National Immunization Program Connections Project Ellen Wild, Director of Programs Patricia Richmond, Program Associate Kansas Integrated Public Health System (KIPHS) Pete Kitch, MBA, Director, KIPHS Project Office Larry Garrett, Staff Epidemiologist , KIPHS Project Office Maine Bureau of Health Lisa Tuttle, MPH, Director, Maine Immunization Program John Pease, Immunization Systems Manager Michael Wenzel, Health Program Manager Missouri Department of Health and Senior Services Garland Land, Center Director, Center for Health Information, Management and Evaluation Nancy L. Hoffman, RN , Deputy Center Director, Center for Health Information, Management and Evaluation Mare Dicneite Bill Gathright George Lauer New York City Department of Health and Mental Hygiene Amy Metroka, Director, Citywide Immunization Registry Paul Schaeffer, MPA, Research Scientist Alex Ternier, Citywide Immunization Registry Vikki Pappadouka Citywide Immunization Registry Oregon Department of Human Services Sherry Spence, MCH Data Systems Coordinator, Office of Family Health, Health Services Marion Sturtevant Buck Woodward Barbara Canavan- Director, Oregon Immunization ALERT, Health Services Don Dumond

v

Rhode Island Department of Health Amy Zimmerman, MPH, Chief Children�s Preventive Services Mike Simoli, Data Manager Mike Berry HLN Consulting, LLC Utah Department of Health Rhoda Nicholas, MBA, PMP, Director of Product Strategy and eGovernment John Eichwald, MA, CHARM Program Manager Barry Nangle, PhD, Director, Center for Health Data Chris Pratt Nancy Pare Other acknowledgements Douglas R. Murray, Director Arkansas Center for Health Statistics Tsai Mei Lin SAS analyst, Arkansas Center for Health Statistics

vi

Table of Contents Acknowledgements ........................................................................................................iv Table of Contents ...........................................................................................................vi Executive Summary .....................................................................................................viii 1. Introduction .............................................................................................................1 2 Overview of De-duplication Technology .................................................................3

2.1 Data-item Transformation ................................................................................3 2.1.1 Dates and times ........................................................................................3 2.1.2 Addresses.................................................................................................4 2.1.3 Measurements and Demographics ............................................................5 2.1.4 Names ......................................................................................................5

2.2 Match Technologies.........................................................................................6 2.2.1 �When� issues..........................................................................................7 2.2.2 �How� issues............................................................................................8 2.2.3 �What� issues.........................................................................................11

2.3 Record Coalescing (Linking or Merging) .......................................................11 2.4 Integration Classifications..............................................................................12

2.4.1 Stand-alone systems ...............................................................................12 2.4.2 Software Development Kits....................................................................13 2.4.3 Server-based systems .............................................................................13

3 Software Products..................................................................................................14 3.1 Products and Their Classification...................................................................14 3.2 Off-line Evaluation ........................................................................................19

3.2.1 Cost........................................................................................................20 3.2.2 Supported Platforms ...............................................................................22 3.2.3 Existing Applications .............................................................................24 3.2.4 Matching Technology.............................................................................26 3.2.5 Merging Technology ..............................................................................28 3.2.6 Product Support......................................................................................29

3.3 Benchmark Evaluation...................................................................................30 3.3.1 Step 1 - Benchmark Evaluation Criteria and Testing Techniques ............30 3.3.2 Step 2 - Setup and Learn the Product ......................................................36 3.3.3 Step 3 - Measure the Product Against the Evaluation Criteria .................36 3.3.4 Step 4 � Compile, Interpret, and Document the Results ..........................38

3.4 Discussion .....................................................................................................38 3.4.1 Finding common basis for comparison..........................................................38 3.4.2 Obtaining evaluation software.......................................................................39 3.4.3 Obtaining or creating meaningful test data ....................................................39 3.4.4 Interpretation of results .................................................................................39

4. Review of De-duplication in Integrated Child-Health Information Systems in Eight Connections Projects .....................................................................................................40

4.1 Rhode Island ..................................................................................................52 4.2 Oregon...........................................................................................................54 4.3 Oregon Immunization ALERT.......................................................................56

vii

4.4 New York City...............................................................................................57 4.5 Missouri.........................................................................................................60 4.6 Kansas ...........................................................................................................63 4.7 Maine ............................................................................................................65 4.8 Utah...............................................................................................................68

5. Observations from Study .......................................................................................71 5.1 Technical Observations ..................................................................................71

5.1.1 Overall de-duplication processes and algorithms ....................................71 5.1.2 Level of automation ...............................................................................71 5.1.3 Record Matching....................................................................................72 5.1.4 Source of information and effective data element for matching...............72 5.1.5 Record Merging .....................................................................................73 5.1.6 Deployment Timetables..........................................................................74

5.2 Non-technical Issues ......................................................................................74 5.2.1 Scope and Organization of the Integration Effort....................................74 5.2.2 Intended Use of the Integrated Data........................................................74 5.2.3 Role of the Immunization Registry Beginnings.......................................75 5.2.4 Role of Vital Records .............................................................................75 5.2.5. Role of Communities of Practice ............................................................76 5.2.6 Program Mandates and Organizational Structure ....................................76 5.2.7 Academic Research ................................................................................77 5.2.8 Strategic Planning ..................................................................................77

5.3 Future Study ..................................................................................................78 5.3.1 Testing and Assessment .........................................................................78 5.3.2 Useful Data Elements and Types of Comparisons...................................79 5.3.3 Impact of Privacy Issues.........................................................................79 5.3.4 Birth-Death Matching.............................................................................79

References.....................................................................................................................80 APPENDIX A � Additional Reference Material ............................................................81

Survey Questionnaire ................................................................................................81 Information from Rhode Island..................................................................................81 Information from Oregon...........................................................................................81 Information from Maine ............................................................................................81 Information from Missouri ........................................................................................81 Information from Arkansas Project ............................................................................81

viii

Executive Summary Child health integration projects create enterprise-wide, person-centric systems from disparate files with different business rules for identification. Data cleaning activities termed de-duplication are performed to match and merge records appropriately. Projects are challenged to select the most effective de-duplication tools and strategies for their environments. Interested Connections projects requested this study to research de-duplication software and approaches, perform limited testing and technical analysis, and document the findings in matrices, showing effectiveness, underlying approach, cost and other factors. This report provides a description, analysis and evaluation of de-duplication software based on vendor information and limited testing, documents de-duplication practices of the participating projects, and discusses different approaches and their efficacy. The study yielded no single best product, but provides a framework to examine alternatives and determine the trade-offs to choose products and strategies that match project requirements. It demonstrates the value of the community of practice and identifies areas for further work.

1

1. Introduction Duplicate records in any database can cause serious data-quality problems and prohibit an information system from reaching its full potential. This is particularly true for people-centric health systems where the real value of the data comes from a user�s ability to view as much information about a person as possible, within the confines of confidentiality guidelines. If the information for a person is spread across multiple, unrelated records, then a user might miss important data about that person. The more complete and accurate the information, the better the services health-care professionals can provide. Fragmented and duplicate data can be particularly acute for integrated child-health information systems, because

• The data for a child comes from multiple sources; • There is no universal key that allows the integrated system to correlate records

from these different sources; • Alternate identifiers, such as names, are often incomplete or subject to change; • The original data may contain errors (e.g. keyboarding errors, missing

information, etc.); and • Similar fields in the various record structures may have inconsistent meanings.

De-duplication is the process of removing redundant data from the database, preventing fragmented and duplicate information from getting into the system, and assuring that queries and updates apply to the correct record [7]. These are difficult issues because redundant information may be hard to spot, correct data may exist in lots of different records, and data may be represented in alternate but equivalent ways. At the Connections1 meeting in Rhode Island in September 2002, several members expressed interest in a study to evaluate de-duplication algorithms on the basis of effectiveness and cost and to determine which combination of available data elements produce the best match rates. They suggested that a project to perform research, technical analysis and limited testing, and to document the findings in a matrix showing trade-offs, effectiveness and cost would be useful to all projects and advance integration efforts. This report presents the results of the ensuing research project. It first examines technology and off-the-shelf products that support de-duplication in some way. To make the de-duplication process more tractable, researchers and software developers divide it into three sub-problems:

• data-item transformation • matching • merging

1 The Connections group is a community of practice sponsored by All Kids Count, a program of The Robert Wood Johnson Foundation.

2

Section 2 describes these sub-problems in more detail and gives some background information on the solutions currently available for each. Section 3 provides a framework for reviewing products that support de-duplication activities and presents a sample evaluation. This report also describes de-duplication processes currently found in eight integrated child-health information systems built by members of the Connections group. Each of these integration projects involves the creation of an enterprise-wide, person-centric system that contains records of individual children and supports programmatic services, operations, reporting, and tracking. They import, link or access files from disparate sources that have different standards and business rules for identifying children, lack universal keys, and contain data inconsistencies and errors. Section 4 compares and summarizes these projects in terms of their scope and approach to de-duplication. The credibility and usefulness of an integrated information system depends heavily on its ability to perform quality assurance tasks, starting with de-duplication. However, de-duplication is complex, resource-intensive and costly process. Integration projects need to consider a number of technical and non-technical issues. Section 5 summarizes these issues and presents some ideas for handling them based on the technology, products, and projects reviewed in Sections 2-4. Section 5 also presents several ideas for future research that would further benefit the integration projects.

3

2 Overview of De-duplication Technology Removing duplicate information in people-centric integrated child-health information systems consists of three main sub-problems: data-item transformation, record matching, and record coalescing. Sections 2.1 � 2.3 discuss basic concepts for these three sub-problems. Solutions to these problems vary not only in underlying technology but also in how they can hook into information systems, particularly integrated health-care systems. Section 2.4 summarizes three general approaches, which we refer to in this paper as Integration Classifications and will use later to help categorize de-duplication products. Independent of the technology or its integration, the ultimate goal is to remove duplicate information. Evaluating the success of a goal, however, is not a trivial matter. Section 3 discusses ways to test and measure the accuracy and efficiency of a complete de-duplication product.

2.1 Data-item Transformation The effectiveness of record matching depends on the quality of the data in the individual records. Data-item transformation involves standardizing and simplifying values in individual records so subsequent record matching can be more efficient and accurate. Such transformations typically map existing values for one type of data (field) to a new (hopefully cleaner) set of values. For example, birth dates are transformed to cleaner birth dates. However, in general, any set of fields can be mapped to any other set of fields. Gelhardas, et al. defined an SQL-like language for defining such data transformations as part of their data-cleaning framework [7]. High-end databases, such as Microsoft SQL Server and Oracle, support similar capabilities. Ideally, every field involved in the matching process should go through a cleaning process. However, for some data fields, this isn�t practical for nor will it lead to significant improvements in matching. Some good candidates for data-item standardization are dates and times, addresses, measurements and demographics, and names.

2.1.1 Dates and times Dates and times, such as birth dates, birth times, vaccination dates, and screening timestamps, can play key roles in determining potential matches. To make efficient comparisons, the dates and times should be in a common, well-defined format. In principle, this is simply a matter of implementing the appropriate date and time transformations. However, in practice, such conversions have to deal with several sticky problems, including garbage data, missing or partial data, inconsistent semantics, and �magic numbers.� See sidebar discussion below for descriptions of these problems.

4

2.1.2 Addresses The reformatting and verification of address information has perhaps one of the biggest potential payoffs in terms of improving the effectiveness of record matching. Addresses, if accurate and standardized, can be excellent discriminators for otherwise similar records. Also, since relatively few people will typically have the same address (except in communities with large apartments or multi-family dwellings), they can help narrow down the search space and thereby improve the speed of the matching process. Putting addresses into a standard format and verifying them against known addresses is a difficult problem. However, there are hundreds of off-the-shelf products and services that do just that. The United States Postal Service (USPS) has developed certification system for these products and services, called the Coding Accuracy Support System (CASS) [15]. CASS enables the USPS to evaluate a product in three areas: ZIP+4 delivery point coding, carrier route coding, and five-digit coding [15]. If a product achieves an accuracy of 98% or better on a test database of 100,000 addresses, the USPS will certify that product for six months [3]. The USPS updates its list of certified vendors on a regular basis. See http://www.usps.com. As of Sept. 5, 2003, there are 500 companies that sell CASS-

Garbage dataGarbage data are any values that do not represent meaningful information with respect to the fields they are in. Garbage data can occur for any number of reasons, the most common being inadequate input-validation during data entry. Other reasons include erroneous conversion of legacy data and shifting in the field�s meaning over time. Regardless of the reason, a solution to garbage data should involve both corrective and preventive actions. Because of the nearly random nature of garbage, corrective actions often require manual inspection and editing of individual records. If there are reoccurring patterns in the garbage, then a database programmer may be able to build a script that automatically cleans up that particular kind of garbage. Preventive actions for garbage data attempt to fix design or implementation failures that allow garbage data to get into the system. Missing or partial data Missing or partial data for data/time fields may not be too serious, as long as the matching process compares missing or partial values in a consistent way. Consider a date field that has year, month, and day subfields and allows the day subfield to be unknown (null). In this case, care needs to be taken during comparison. May 2003 (with no day specified) should match any date between May 1, 2003 and May 31, 2003. Inconsistent semantics Inconsistent semantics exist when the values of a field can have different meanings. Consider an integrated child-health information system that captures summary information (date and event type) for various events, such as newborn screenings, vaccinations, and hearing screenings. Because the information is coming from different sources, the values for event date may actually have different semantics. For one system, date values might represent day on which the event started; in others, it might represent the day on which the event ended or the day on which the event was entered into the system. Such inconsistencies arise when the individual systems that comprise an integrated system evolve independently (which is typically the case) or as field semantics change over time. The solution to inconsistent semantics is basically the same as for garbage data, except the corrective action is often more amenable to automation. Magic numbers Magic numbers are valid values that have been given special meaning. For example, a birth date field might contain a �99/99/99� to mean that the baby was stillborn. Magic numbers can be particularly problematic for integrated health-care system because their meaning is not likely to be shared across all the participating health-care programs. Matching records based using such values can have unexpected and undesirable results. The solution is to convert magic number to common, standardized values that represent their intended meaning. This may necessitate defining new data fields that hold extra information about special conditions. For example, the system could translate the above �99/99/99� to a null birth date and flag in a separate field that explicitly means stillborn.

5

certified products or services for [16]. These products range from stand-alone, self-contained software systems to customizable package with network interfaces to pay-as-you-go Internet services. Some of the products work in a batch mode that is intended for behind-the-scenes clean up of existing data. Others provide an interface for interactive address correction. Still others support both modes of operation. The products range in price from several hundred to tens of thousands of dollars. The low-end products are mostly stand-alone systems that provide limited features and don�t support any kind of integration with existing information systems. Most of the high-end products work in multiple modes, support a variety of interactive or batch interfaces, and are customizable. A more detailed evaluation of address-cleaning software is beyond the scope of this report. However, integration projects that are considering buying or building address-cleaning software should look at packages that support both interactive and batch processing. They will provide the most flexibility in terms of how they integrate into the health system. Stand-alone address-cleaning systems are not likely to represent cost-effective solutions.

2.1.3 Measurements and Demographics If the matching process takes into account measurements or demographics, then to be effective, it is important to standardize the data in these fields as well. For example, consider an integrated system that involves vital statistics and newborn screening data. One system may record a child�s birth weight in ounces and the other in grams. This often occurs depending on whether the birth weight comes from a birth record, medical record, or anecdotal record. To compare these numbers for matching purposes, they need to be in a common unit of measure. Similarly, if the matching process uses race or ethnicity, the system needs to make sure that the possible values for these fields are well defined and consistent. For example, a program�s ethnic definitions may vary from census definitions or patients may self-identify their race differently from how it is officially classified. In both cases, if race and ethnicity are to play a role in the matching process, data-transformation software must map them to common classification schemes. On the surface, it would seem that simple data transformation operations would be able to standardize measurements and demographics. However, like date and time values, measurements and demographics can suffer from garbage data, missing or partial results, inconsistent semantics, and magic numbers. The same solutions discussed above for dates and times apply here. Even relatively static fields like race and ethnicity are subject to these problems because their categorizations can change over time.

2.1.4 Names Since names are widely used in record matching, they are obvious candidates for standardization and simplification. However, as with addresses, this is a difficult and potential costly endeavor because name data can contain a wide variety of

6

inconsistencies. Smith explains that these problems have many causes, including data entry errors, the use of aliases and nicknames, difference in spellings, cultural factors, and historical factors [13]. Green and Lutz, also explain:

One type of data that has been persistently problematic for automatic processing is that of named entities, especially personal names. Unlike other data elements, such as Social Security numbers or other kinds of ID�s named entities, names can show significant, sanctioned variation. Furthermore, names tend to be much more variable in spelling than other lexical items. Predicting the way a particular name of a particular individual will be spelled is often problematic [8].

Researchers and developers have tried many different approaches for standardizing and simplifying names for matching purposes. One approach attempts to break the name down into individual pieces, organizing by type of name and transforms those pieces into values that can more readily compared. For example, a name like �Maria Jessica dela Lopez Garcia� would result in the following pieces: a baptismal name of �Maria,� a given name of �Jessica,� a particle of �dela,� a patronimic name of �Lopez,� and a matronimic name of �Garcia.� Using Soundex [11] or some other encoding scheme, each of these pieces could then be simplified for comparison. The problem with this approach is determining the pieces and their name types. This can be difficult even in databases that store names in two or three fields, such as last name, first name, and middle initial. Many names, like �Maria Jessica dela Lopez Garcia� don�t fit the pattern, and leave data-entry staff to their own devices for interrupting and entering the name into the system. Consequently, two different data-entry people could easily add the same child to a system with two very different names. Other approaches simply remove extraneous characters from names and then let the matching process do string comparison and edit-distance comparisons. String comparisons are fast, but don�t allow for data-entry errors or spelling variations. Edit-distance comparisons can be slow, but are relatively forgiving for common data-entry errors and some types of spelling variations. Off-the-shelf name transformation and classification products are available. One example is the NameClassifier� by Language System Analysis. A detailed evaluation of name or other data-item transformation products is beyond the scope of this study.

2.2 Match Technologies Given a subject record (either an existing record in a database or information about a person that might be added to a database), matching is the process of finding existing records that might be for the same person. Some researchers further divide matching into two sub-problems: finding candidate records and clustering then into groups of matching or potentially matching records [7]. There are three questions about any given matching product that are of particular interest for integrated health systems:

7

1. When does the matching occur? 2. How does the matching algorithm work? 3. What data does or can the matching algorithm use?

2.2.1 �When� issues There are two basic answers to the first question: interactive front-end and automated back-end. Systems that support front-end matching allow users to look for potential matches prior to adding a new record into the system or during the processing of adding a new record. If a match is found, it is used instead of adding the new record. The aim of front-end matching is to minimize the number of duplicate records that actually get into the database. It can also take advantage of a user�s first-hand knowledge about the person. For example, consider the following scenario:

1. A mother brings a child named �Sue Smith� into a clinic and a clerk begins a data-entry process for the child.

2. Using front-end matching software, the clerk first searches for existing �Sue Smith� records.

3. The matching software returns three candidates, so the clerk asks the mother for more details like current address and child�s birth date.

4. Based on this first-hand information, the clerk then determines that one of the three existing records is actually for this �Sue Smith�.

5. Instead of creating a new record (and a potential duplicate), the clerk simply uses this existing record.

Obviously, if the matching software had returned zero candidates or if the clerk determined that none of them were for this child, then the clerk would add a new record. Back-end matching occurs among records that are already in a database. Systems that support this mode of operation periodically step through the records in a database and check to see if each one matches any others. Typically, a back-end approach involves organizing records into groups or clusters, each representing a set of possible matches. Some back-end matching systems compute a confidence rating for each cluster that indicates how likely the match is real. Furthermore, some systems will even try to automatically resolve the duplicates in a group, if the confidence rating is high enough. (See the Section 2.3 for more information on different ways that duplicates can be resolved.) Such systems may also set aside for manual review any clusters that represent potential matches, but for which the confidence is not high enough to process automatically. Instead, a user will review these clusters more closely at a later time and then determine if they actually match. Here is a typical scenario for a back-end solution.

1. The matching software first selects a subject record from the database, for example, �Joe Jones�.

2. It then finds four potential matches for �Joe Jones�, and creates a cluster containing the original record and the four matches.

8

3. At the same time, it computes a confidence rating indicating that the likelihood of these records all being for the same child is good, but not high enough to resolve automatically.

4. The matching software sets the cluster aside for manual review. 5. Some time later, a user inspects the records more closely, determines that they are

all for the same child, and proceeds with an interactive merging of the data. The advantage of a back-end approach is the system can automatically find and resolve large numbers of exact duplicates without any human intervention. This can be very valuable for integrated systems where large numbers of records are coming from multiple sources. The disadvantage of a back-end approach is that, even though some deferred user interaction is possible, the approach cannot easily take advantage of first-hand user knowledge in determining actual matches. When a cluster is formed and set aside for manual review, the user who reviews that cluster will probably not have immediate access to the real person(s) represented by the records in the cluster. Researching the records to make a matching determination can time consuming and costly. Some matching software products support both front-end and back-end processes. Since both approaches have complementary advantages, an integrated health system could benefit from both.

2.2.2 �How� issues Matching algorithms come in four basic flavors: single-field comparison, multi-field matching, rule-based matching, and machine learning. 2.2.2.1 Single-field comparison algorithms Algorithms based on single-field comparisons attempt to find potential matches by quickly comparing individual fields, typically under a user�s direction. WinPure (described further in Section 3) is an example of a product that takes this approach. The user simply chooses a field in the record structure, such as phone number, and the system clusters together all the records with similar values for that field. This approach can be fast, but is limited in terms of how it finds meaningful matches. 2.2.2.2 Multi-field matching algorithms Multi-field matching algorithms can take a wide range of forms. However, they all attempt to find matches by comparing multiple fields from two records and then computing some kind of aggregate matching score by combining the results of the individual field comparisons. Often the algorithm is customizable in terms of which fields it uses, the comparison functions for each field, and how it combines individual field results to form record matching scores. A couple of differences between products with customizable multi-field matching are the number of individual comparisons that they support and the kinds of fields that they allow to be compared. For example, PostalSoft only allows matching on up to eight different fields. All but three of these have to come from a pre-defined list typically

9

consisting of address-book fields. Many fields common in health information systems, such as birth date, are not in the list. Three of the eight fields can be user-defined. Another way in which products differ is the types of comparisons that they support. Below is a list of some common categories of comparison functions:

Relational This category includes basic equals, less than, greater than, and not equals kinds of comparison functions.

Partial string The category includes string comparison functions that limit the comparison to a specific number of characters, e.g. just the first five letters of the last name.

Containment This category includes functions that can determine whether a field value is either fully or partially contained within another.

Ranges This category includes functions that determine whether a numerical or data field value is within some specified range of another, e.g. the birth date differs by no more than seven days.

Edit-distance This category includes functions that determine the minimum of number of editing operations (insert a character, delete a character, or replace a character) necessary for making two values the same. Edit-distance is a good approximation of keystroke errors that may have occurred if the two values were supposed to be the same.

Soundex Matching Soundex comparisons match strings (typically names) with different spellings but similar sequences of character sounds. They do this by first removing non-essential characters (all non-initial vowels, H�s, Y�s, and W�s) from the words in the string and then, based on a set of rules, encode the remaining characters as sequence of digits. These numerical sequences represent standardized sounds for key letters; they do not represent the pronunciation of the words in the strings. Two strings are then compared by these corresponding Soundex encodings. Robert Russell first proposed the original Soundex idea in 1918, long before the electronic information system. Since then, researchers have proposed many variations of the idea [11]. Today, many database systems provide direct support for information retrieval based on Soundex comparisons.

10

Orthographic comparisons Unlike Soundex comparisons, the functions attempt to compare words (typically names) based on their pronunciation [4, 9, 10].

Some products also suppose probabilistic field comparisons that take into account the frequency of the possible field values. The more frequent the value, the weaker the comparison. For example, in comparing first names, the strength of two matching �Michael� values would be considerably less than the strength of two matching �Sylvester� values because �Michael� is more probable and therefore less discriminating. Probabilistic field comparisons can improve the accuracy of a matching algorithm but require more computation, and therefore may be slower. The ways in which multi-field matching algorithms combine the results of individual field comparison range from simple logical combinations (AND�s and OR�s) to weighted sums where is each field comparison counts for a certain percentage of the total match score. 2.2.2.3 Rule-based matching algorithms Rule-based matching algorithms are similar to multi-field matching algorithms in that they can involve multi-field comparisons and a variety of comparison functions. However, they don�t determine a match by combining the individual field comparisons into a single score. Instead, they apply a set of decision rules, i.e. �IF<condition>THEN <action>� statements. The conditions consist of field comparisons, and the actions consist of �match� or �no-match� conclusions. If a rule�s condition is true, then its action is taken. Below is a very simple example rule set. In the conditions, the �<r>.<field>� notation represents a field value, where r is either r1 (a subject record) or r2 (a candidate matching record) and field is a name of a field in the record structure. 1. IF r1.social_security_number = r2.social_security_number THEN match 2. IF SoundexCompare(r1.last_name, r2.last_name) AND SoundexCompare(r1.first_name, r2.first_name) AND EditDistance(r1.birth_place, r2.place)<2 AND r1.birth_date = r2.birth_date AND r1.multiplicity = r2.multiplicity AND r1.birth_order = r2.birth_order THEN match The advantage of rule-based matching over multi-field matching is that it can short-circuit the comparison computations by testing high-confidence or most discriminating rules first. For example, with the above rule set, a rule-based algorithm would test rule #1 first. If doing so, it would compare just the SSN�s of the two records. If they are exactly the same, it would declare the records a match and it would not continue with the other field comparisons. In many cases, this would result in dramatically speeding up the overall matching time.

11

2.2.2.4 Machine-learning algorithms The problem with multi-field comparison and rule-based approaches is that someone has to figure out which fields are most useful in determining matches, how to best compare those fields, and how the result of these comparisons determine (or don�t determine) matches. A machine-learning algorithm attempts to solve this problem by allowing the software to customize itself. It does this through a training process in which pairs of records are fed into the system along with their true match/no-match status. For each training pair, the system attempts to compute its own match/no-match result based on its current settings. If it gets the right answer, it reinforces the current settings. If it gets the wrong answer, it tries to figure out what would have helped produce the right answer and alters its settings a little in that direction. By running lots of training data through the system, it can eventually tune its own configuration to correctly compute all answers. At this point, the system should be able to accurately match other pairs of records not in the training data. The challenge with machine-learning algorithms is in creating a training set that represents all the problematic variations in the real data and will enable the algorithm to converge on a stable configuration.

2.2.3 �What� issues Theoretically, matching algorithms can match records based on any piece of available data. However, in practice, off-the-shelf products often make assumptions about what information is available and what will be the most discriminating. For integrated child-health information systems, the key is whether the product can be configured or adapted to use fields that are not common in other person-centric files. Some good discriminators for child-health information systems include birth date, birth multiplicity, birth order, and mother�s maiden name.

2.3 Record Coalescing (Linking or Merging) Once a system has found some matching records and organized them into groups or clusters, the next step is to remove the duplicate data. We call this process record coalescing. For front-end matching systems, record coalescing can occur as an integral part of the matching process and will typically deal with just one cluster at a time. For back-end systems, the system may attempt to do some records coalescing (for high-confidence matches) immediately or it may defer this process for later. In general, record coalescing can be accomplished by doing one of the following:

1. Deleting all but one of the records in a cluster 2. Merging the data from all the records in a cluster into one record 3. Linking together all the clusters in a cluster so that if one of them is retrieved, the

other can also be easily retrieved if necessary. The first option is not realistic for health information systems since it could result in the loss of valuable information. Unfortunately, it is sometimes the only option supported by low-end products. The choice between the second and third options depends largely on

12

the design of the information system and on external constraints. In some cases, there are restrictions against modifying patient records. For example, in Maine, changes cannot be made to Medicaid address data, except through the Medicaid system by authorized Medicaid personnel. In this case, the only choice is to logically link the matching records together. If merging is possible, then it is often a cleaner choice because it eliminates redundant data and thereby avoids confusion. Merging, however, is not a trivial matter. Some of the problems include:

1. Standardizing the data-item values, which involve many of the same issues raised in Section 2.1 but on a broader scale.

2. Resolving data-item conflicts, which answers questions like �which source of the data is more authoritative�, �can missing information (a null value) overwrite an existing value�, �should all known values for a field be kept�, etc.

3. Determining when in the overall process merges take place. 4. Determining who will be responsible for merging and resolving records.

These issues can be very complex and need to be addressed in the context of a specific integrated system.

2.4 Integration Classifications The number of different ways in which de-duplication technology is packaged and integrated into information systems is almost as large as the number of individual products. However, in an attempt to characterize the technology and classify the products, we can break them into three general categories: stand-alone systems, software development kits, and server-based systems.

2.4.1 Stand-alone systems With stand-alone systems, there is no program coupling between the information system and the de-duplication software, except for the transfer of records between the two. A user typically has to manually:

1. Export all of the records from the information system 2. Import them into the de-duplication system 3. Perform the de-duplication activities 4. Export all the records from the de-duplication system 5. Import them back into the information system

Obviously for large systems, like integrated child-health information systems, this is not practical. In some cases, the de-duplication software might be able to automate steps 1, 2, 4, and 5 but only if it can directly read the information system�s database. Still, these steps would take considerable time for large databases and the information system may have to be off-line during the whole process in order to avoid synchronization problems.

13

2.4.2 Software Development Kits Software Development Kits (SDK�s) are libraries of re-usable de-duplication software components. Information-system programmers can use these software components to integrate de-duplication functionality directly into the systems that they build. SDK�s offer programmers a high-degree of flexibility, since the programmers are in control of how and where the de-duplication occurs in the information system. However, using a SDK creates a significant dependency between the information-system and an outside product. If the SDK�s change (sometimes in very modest ways), then the information-system will likely also have to change. Among software developers, this is referred to as high coupling and is typically considered undesirable.

2.4.3 Server-based systems Like SDK�s, server-based products allow information systems to access de-duplication features directly in the code. However, the de-duplication software is logically separate from the information system and typically runs as an independent process called a �server�. A server provides access to �services� like address cleaning, record matching, or merging via well-defined programming interfaces. As long as these interfaces don�t change, updates to the de-duplication software will not cause changes to the information system. So, server-based approaches offer a high-degree of flexibility like the SDK�s, but without the high coupling. Server-based approaches can also lead to other benefits, including improved scalability and better performance through the use of concurrency. Since the de-duplication software runs in a separate process, it can live on a different computer and can thus allow the computational resources to grow more incrementally. Theoretically, programmers could also replicate the de-duplication server on multiple machines. This would allow the information system to execute concurrent de-duplication operations and thereby improve overall performance.

14

3 Software Products Many products that address the de-duplication process are already commercially available. Some of these simply deal with a single part of the process, while others deal with most or all of it. This section summarizes an evaluation of a sampling of these products. Although it is not exhaustive, it provides some insights into the state-of-the-art for commercial de-duplication products and the challenges associated with evaluating such products.

• Section 3.1 lists the evaluation candidates and categorizes them according to which part(s) of the de-duplicate process they address.

• Section 3.2 describes a first-pass evaluation that looked at eight different products using criteria that can be tested without actually running the software. We�ll refer to this evaluation as the off-line evaluation.

• Section 3.3 discusses a more in-depth evaluation that tests a product against a known data set. We�ll refer to this kind of evaluation as benchmark evaluation. Because of the limited availability of actual software, the team was able to conduct a benchmark evaluation for only one product.

• Section 3.4 discusses this and other challenges associated with reviewing and selecting de-duplication products.

3.1 Products and Their Classification Table 3.1 identifies a wide range of de-duplication products that could be candidates for the off-line and benchmark evaluations. However, it is not an exhaustive list. For example, the US Postal Service alone has certified over 1170 address-cleaning products [16], and this is just the tip of the de-duplication software domain. Furthermore, an exhaustive list would not be of much long-term value since products are always coming and going. The real value of Table 3.1 is in illustrating how to begin an evaluation by identifying and classifying candidate products. The first column of Table 3.1 contains the product names and vendor information. The second column provides brief descriptions of the products. The 3rd, 4th, and 5th columns indicate whether each project deals with data-item transformation (e.g., address cleaning), matching, and/or merging, respectively. The column labeled Integration Type indicates whether each product is packaged as a stand-alone software system, a software development kit (SDK), or a server-based system (see Section 2.4.). The last two columns indicate whether the project team could obtain sufficient technical documentation for off-line evaluation and a copy of the actual software for a benchmark evaluation. For this project, the team explored the Internet, studied existing health-care information systems, and looked at people-centric database systems, such as genealogical systems, to uncover what products were currently in use and to get ideas for classifying them. Note

15

that some products that have come up in conversations with various Connections members are not in the table because the project team could not find any information about them.

16

Tab

le 3

.1 �

Can

dida

te P

rodu

cts

Prod

uct N

ame

Vend

or

Des

crip

tion

Supp

ort

s D

ata-

item

Tr

ans.

Su

ppor

ts

Mat

chin

g Su

ppor

ts

Mer

ging

In

tegr

atio

n Ty

pe

Tech

. In

fo.

Avai

labl

e

Eval

. So

ftwar

e Av

aila

ble

Ab In

itio

Ab In

itio

Hig

h-pe

rform

ance

sof

twar

e lib

rary

and

gra

phic

al

envi

ronm

ent f

or d

ata

trans

form

atio

n.

Y N

N

St

and-

alon

e Y

N

AMAD

EA

ISO

FT

Dat

a ex

tract

ion,

tran

sfor

mat

ion,

an

d re

al-ti

me

repo

rting

sof

twar

e.

Y N

N

St

and-

alon

e N

N

iMan

ageD

ata(

tm)

BioC

omp

Syst

ems

Inc.

Acce

sses

, cle

ans,

filte

rs,

conv

erts

and

tran

sfor

ms

data

fro

m te

xt fi

les,

Exc

el, O

racl

e D

atab

ases

, SQ

L Se

rver

da

taba

ses,

and

mor

e.

Y Y

Y St

and-

alon

e or

SD

K Y

Y

Cen

trus

Mer

ge/P

urge

Li

brar

y Q

ualit

ativ

e M

arke

ting

Softw

are

Cle

ans

cust

omer

info

rmat

ion

and

iden

tifie

s du

plic

ate

reco

rds.

Y

Y Y

Stan

d-al

one

N

N

Cho

iceM

aker

2.2

C

hoic

eMak

er

Dat

a qu

ality

and

dat

abas

e re

cord

mat

chin

g, m

ergi

ng, &

de

dupl

icat

ion

softw

are

base

d on

pa

tent

ed A

I and

mac

hine

le

arni

ng te

chni

ques

.

N

Y Y

Serv

er-b

ased

or

Sta

nd-a

lone

Y N

Dat

a M

anag

er

GG

Mat

e Vi

sual

Bas

ic G

UI a

pplic

atio

n fo

r da

ta tr

ansf

orm

atio

n fo

r W

in95

/Win

98.

Y Y

Y St

and-

alon

e Y

Y

Dat

aSet

V

Inte

rcon

IIA

Mat

chin

g, d

edup

�ing,

retri

evin

g &

Min

ing

Suite

. Y

Y Y

Stan

d-al

one

or

SDK

N

Y

Dat

asko

pe

Life

Cyc

le S

oftw

are

Dep

artm

ent-l

evel

tool

s to

map

, tra

nsfo

rm, a

larm

, out

put a

nd

view

hig

h vo

lum

es o

f bin

ary

or

ASC

II in

put d

ata.

Y N

N

St

and-

alon

e Y/

N

Y

17

Dat

a To

ols

Twin

s D

ata

Tool

s C

lean

s ad

dres

ses

in d

atab

ase

that

sup

port

OD

BC c

onne

ctio

ns,

prov

ides

pre

-def

ined

mat

chin

g al

gorit

hms.

Y N

N

Se

rver

-bas

ed

No

No

DfP

ower

D

ataF

lux

Cor

pora

tion

Poin

t and

clic

k dr

iven

. Ana

lyze

s da

ta a

nd s

tand

ardi

zes

data

fie

lds,

de-

dupl

icat

es a

nd h

as

built

-in d

atab

ase

conn

ectiv

ity.

Y Y

Y St

and-

alon

e Ye

s/N

o Ye

s

Dou

bleT

ake,

Sty

leLi

st,

Pers

onat

or

Peop

lesm

ith

Split

s na

mes

, add

ress

es a

nd

city

, sta

te a

nd z

ip c

odes

. Sev

eral

m

atch

code

s su

ppor

ted,

sev

eral

m

atch

ing

crite

ria u

sed

sim

ulta

neou

sly,

OD

BC a

cces

s.

N

Y Y

Stan

d-al

one

Yes/

No

Yes

DQ

Now

D

Q N

ow

Prof

iling,

cle

ansi

ng, a

nd d

edup

to

ols,

pro

vidi

ng a

cle

ar v

iew

of

the

data

.

Y Y

Y St

and-

alon

e Ye

s/N

o Ye

s

Grit

Bot

Rule

Que

st R

esea

rch

Iden

tifie

s an

omal

ies

in d

ata

(com

patib

le w

ith S

ee5

and

Cub

ist).

Y N

N

SD

K N

o Ye

s

Hum

min

gbird

ETL

H

umm

ingb

ird

Dat

a in

tegr

atio

n so

lutio

n.

Y Y

Y St

and-

alon

e Ye

s Ye

s

Post

alSo

ft,

first

Logi

c

Cle

ans

cust

omer

rela

ted

info

rmat

ion.

Cle

anin

g pr

oces

s is

co

mpo

sed

by 6

ste

ps: P

arsi

ng,

Cor

rect

ion,

Sta

ndar

diza

tion,

D

ata

Enha

ncem

ent,

Mat

chin

g,

Con

solid

atin

g

Y Y

Y St

and-

alon

e Ye

s/N

o Ye

s

Inte

grity

, Va

lity

So

lve

data

pro

blem

s co

mm

on

whe

n re

-eng

inee

ring

Lega

cy

Syst

ems,

Dat

a m

inin

g, d

ata

typi

ng, e

ntity

iden

tific

atio

n in

vest

igat

ion,

sta

ndar

diza

tion,

m

atch

ing

and

surv

ivor

ship

Y Y

Y St

and-

alon

e Ye

s/N

o N

o

Link

Solv

Li

nkSo

lv

De-

dupl

icat

ion

Softw

are

N

Y Y

Stan

d-al

one

No

No

18

mat

chIT

, H

elpI

T Sy

stem

s Li

mite

d

Poin

t and

clic

k in

terfa

ce;

Use

rs s

peci

fy m

atch

key

s

Allo

ws

user

to d

efin

e fie

lds

to

com

pare

and

thei

r im

porta

nce

Y Y

Y St

and-

alon

e Ye

s/N

o Ye

s/N

o

Mer

ge/P

urge

Plu

s,

Gro

up1

Softw

are

Cle

ans

nam

es a

nd a

ddre

sses

. Fi

xed

set o

f sup

porte

d m

atch

ing

optio

ns. A

pplie

s to

mul

tiple

file

s.

Y Y

Y St

and-

alon

e,

Tool

Kit

N

N

NoD

upes

, Q

uess

, Inc

.

Cle

ans

spec

ific-

dom

ain

data

re

late

d to

indi

vidu

als,

com

pani

es

and

prod

ucts

, use

r lev

el o

f m

atch

ing,

cus

tom

izab

le.

N

Y Y

Stan

d-al

one

Yes/

No

Yes

SSA-

Nam

e/D

ata

Clu

ster

ing

Engi

ne,

Sear

ch S

oftw

are

Amer

ica

Cla

ims

to s

olve

man

y da

ta

prob

lem

s, c

lean

s cu

stom

er n

on-

form

atte

d re

late

d in

form

atio

n,

gene

rate

s m

ultip

le k

eys

and

stor

es th

em in

a d

atab

ase

inde

x,

perm

its it

erat

ive

tuni

ng

N

Y Y

SDK

Yes/

No

Yes

Sync

sort

Sync

sort

Fast

hig

h-vo

lum

e so

rting

, fil

terin

g, re

form

attin

g,

aggr

egat

ing,

and

mor

e

Y N

N

St

and-

alon

e Ye

s/N

o Ye

s

Sure

Cle

anse

D

Q G

loba

l Im

prov

es d

ata

accu

racy

by

ensu

ring

unde

rlyin

g da

taba

ses

are

dupl

icat

e-fre

e.

Y Y

Y SD

K, S

tand

-al

one

Yes/

No

Yes

Twin

Find

er

Om

ikro

n

Cle

ans

nam

es a

nd a

ddre

sses

fro

m d

oubl

ette

s. U

ses

the

lingu

al/m

athe

mat

ical

FAC

T al

gorit

hm fo

r fuz

zy p

atte

rn-

mat

chin

g.

Y Y

Y St

and-

alon

e Ye

s Ye

s

Win

Pure

Pro

W

inPu

re

Pow

erfu

l dat

a cl

eani

ng s

oftw

are,

in

clud

ing

dupl

icat

ion

rem

oval

, em

ail s

ugge

stio

ns, s

tatis

tics

and

mor

e.

N

Y Y

Stan

d-al

one

Yes/

No

Yes

19

3.2 Off-line Evaluation Since resources for this project were limited, the off-line evaluation could only cover a relatively small number of products. Based on comments from the Connections group, the team established the following guidelines for prioritizing the products and selecting eight of them for the off-line evaluation. Table 3.2 lists those that were selected.

• Products that support matching and merging took precedence over products that supported just data-item transformation.

Rationale: The focus of this study is on de-duplication in integrated health systems and in these kinds of systems, matching and merging are the tougher problems.

• Server-based or SDK products

took precedence over stand-alone products.

Rationale: Stand-alone products are difficult (if not impossible) to incorporate into integrated systems because by definition they do not provide an electronic interface for submitting de-duplication requests. See Section 2.4. Server-based systems or SDKs do, and therefore, have at least some potential for being incorporated into an integrated system.

• Products with insufficient technical documents would not be considered for the

off-line evaluation.

Rationale: There is no sense in evaluating a product if technical documentation is not available.

The availability of technical documentation information ended up dominating the selection process. Because the server-based or SDK products are typically more flexible and require more custom configuration, the technical documentation that was available up-front for these products was less specific. As a result, the list of selected products contained more stand-alone systems than originally desired. The off-line evaluation involved studying product information and technical documentation for each of the selected products and comparing them in the following areas: cost, platform, existing applications, matching technology, merging technology, and product support. Sections 3.2.1 � 3.2.6 summarize the findings in each of these areas. Appendix A contains the complete set of data for the off-line evaluation.

Table 3.2 � Products selected for the off-line evaluation

ChoiceMaker 2.2, ChoiceMaker DataSet V, Intercon Systems, Inc. DfPower, DataFlux Corporation PostalSoft, firstLogic Merge/Purge Plus, Group1 Software DoubleTake, Stylelist, and Personator, Peoplesmith SSA-Name/Data Clustering Engine, Search Software America WinPure, Winpure, Ltd.

20

3.2.1 Cost Since the products in Table 3.1, and even those in Table 3.2, vary considerably in their categorization and sophistication, it�s difficult to compare prices directly. However, one way to form a basis for comparison is to separate the costs into three areas: up-front, recurring, and indirect costs. The up-front costs include one-time purchasing or licensing costs. Recurring costs include any periodic maintenance or update fees, per-use charges, or subscription fees. The indirect costs include programming, setup, training, and other miscellaneous costs. Indirect costs can be one-time, periodic, or ongoing. Because of all the variables that can affect them, they cannot be expressed here simply in terms of dollars. Table 3.3 summarizes cost information with respect to these three areas for each of the eight products. Note that even with this break down the costs cannot be compared directly. For example, some have licensing fees based on the size or type of the organization (ChoiceMaker, DfPower, and Merge/PurgePlus); others are based on the number of users (DataSet V, WinPure); and still others take into account the size or type of computer that will host the software (SSA-Name/Data Clustering Engine). The best way to compare costs is to do it in the context of a specific application. This will allow the evaluator to fix certain variables like size of organization, number of records, and type and size of the host machine.

21

Tab

le 3

.3 �

Pro

duct

Cos

ts

Pr

oduc

t U

p-fr

ont C

osts

R

ecur

ring

Cos

ts

Indi

rect

cos

ts (s

etup

, tra

inin

g, p

rogr

amm

ing,

etc

.) C

hoic

eMak

er 2

.2

$25K

- $2

50K

licen

sing

fee

Subs

eque

nt a

nnua

l m

aint

enan

ce fe

e eq

ual t

o 18

% �

20%

of l

icen

sing

fe

e

Cou

ld ta

ke a

nyw

here

from

a fe

w d

ays

to a

few

mon

ths

of a

pr

ogra

mm

er�s

tim

e to

inco

rpor

ate

this

pro

duct

into

an

inte

grat

ed h

ealth

info

rmat

ion

syst

em.

Also

, the

Cho

iceM

aker

sy

stem

nee

ds to

be

train

ed u

sing

the

loca

l dat

a. D

evel

opin

g th

is tr

aini

ng d

ata

and

doin

g th

e tra

inin

g ca

n ta

ke

cons

ider

able

tim

e.

Dat

aSet

V

$150

0 fo

r 2-1

0 us

ers.

$10

00 fo

r ea

ch a

dditi

on

user

.

Not

spe

cifie

d C

ould

take

con

side

rabl

e tim

e to

inco

rpor

ate

into

an

inte

grat

ed h

ealth

info

rmat

ion

syst

em, i

f eve

n po

ssib

le.

Beca

use

the

prod

uct i

s st

and-

alon

e, th

ere

wou

ld b

e an

on-

goin

g in

dire

ct c

ost e

xist

ing

for d

e-du

plic

atio

n ac

tivity

. D

fPow

er

$25K

- $5

00K,

lic

ensi

ng fe

e N

ot s

peci

fied

Vend

or p

rovi

des

cons

ultin

g an

d tra

inin

g co

sts,

but

cos

t will

va

ry d

epen

ding

on

the

appl

icat

ion.

Po

stal

Soft

$15K

Su

bseq

uent

Ann

ual

mai

nten

ance

fee

equa

l to

15%

of l

icen

sing

fees

Con

side

rabl

e in

dire

ct c

osts

for c

onfig

urin

g th

e so

ftwar

e,

impo

rt/ex

port

of th

e da

ta, c

usto

miz

e th

e re

sults

, etc

.

Mer

ge/P

urge

Plu

s $2

0K -

$250

K lic

ensi

ng fe

e Su

bseq

uent

ann

ual

mai

nten

ance

fee

equa

l to

15%

� 2

0% o

f lic

ensi

ng

fee

Indi

rect

cos

ts in

clud

e pr

ogra

mm

ing

time

to in

corp

orat

e th

e du

plic

atio

n to

ols

into

an

inte

grat

ed h

ealth

info

rmat

ion

syst

em.

Dou

bleT

ake,

Sty

lelis

t, an

d Pe

rson

ator

$3

995

Subs

eque

nt a

nnua

l m

aint

enan

ce fe

e eq

ual t

o 15

% -2

0% o

f lic

ensi

ng

fee

Beca

use

the

prod

uct i

s st

and-

alon

e, th

ere

wou

ld b

e a

huge

on

-goi

ng in

dire

ct c

ost f

or p

erio

dica

lly im

port/

expo

rting

dat

a.

SSA-

Nam

e/D

ata

Clu

ster

ing

Engi

ne

$66,

000

licen

sing

fee

for

1 In

tel-c

lass

C

PU

Subs

eque

nt A

nnua

l m

aint

enan

ce fe

e eq

ual t

o 15

% o

f lic

ensi

ng fe

es

Beca

use

the

prod

uct i

s st

and-

alon

e, th

ere

wou

ld b

e a

huge

on

-goi

ng in

dire

ct c

ost f

or p

erio

dica

lly im

port/

expo

rting

dat

a.

Win

Pure

$1

49 p

er u

ser

Non

e Be

caus

e th

e pr

oduc

t is

stan

d-al

one,

ther

e w

ould

be

a hu

ge

on-g

oing

indi

rect

cos

t for

per

iodi

cally

impo

rt/ex

porti

ng d

ata.

22

3.2.2 Supported Platforms Table 3.1 classifies the products in terms of their Integration Type, which could be stand-alone, server-based, or SDK. (See Section 2.4 for descriptions of the three general categories.) Although this classification gives a broad view of a product�s potential for being incorporated into an integrated system, it by no means tells the whole story. Another critical question related to integration potential is, what platforms do the products support? Informally, a platform is any computing environment defined by hardware specifications, an operating system, communication software, and any other prerequisite software (virtual machines, databases, etc.) Many of the products support a variety of platforms, while others are tied to a specific one. They also vary with respect to the degree of their dependency on a platform. Those that are heavily tied to a specific platform may be slower to take advantage of advances in hardware, new operating systems and databases, etc. Table 3.4 summarizes platform support findings for the eight selected products.

23

Tab

le 3

.4 �

Pla

tform

Req

uire

men

ts o

r R

estr

ictio

ns

Pr

oduc

t H

ardw

are

Ope

ratin

g Sy

stem

s C

omm

unic

atio

n So

ftwar

e O

ther

Sof

twar

e

Cho

iceM

aker

2.2

N

o sp

ecia

l re

quire

men

ts

Non

e �

can

run

on a

nyth

ing

that

su

ppor

ts a

JVM

.

Unk

now

n St

rong

dep

ende

ncy

on th

e Ja

va V

irtua

l Mac

hine

(J

VM).

How

ever

, a re

ason

able

impl

emen

tatio

n of

the

JVM

exi

sts

for a

lmos

t eve

ry ty

pe o

f ope

ratin

g sy

stem

. D

ataS

et V

N

o sp

ecia

l re

quire

men

ts

Win

dow

s N

one

Non

e

DfP

ower

N

o sp

ecia

l re

quire

men

ts

Linu

x, U

nix,

W

indo

ws

Non

e N

one

Post

alSo

ft N

o sp

ecia

l re

quire

men

ts

Linu

x, U

nix,

W

indo

ws

95 o

r hi

gher

Non

e N

o ad

ditio

nal s

oftw

are

requ

ired,

exc

ept t

hat a

ll th

e C

Ds

give

n w

ith p

rodu

ct s

houl

d be

cop

ied

onto

the

syst

em.

Mer

ge/P

urge

Plu

s N

o sp

ecia

l re

quire

men

ts

Win

dow

s 95

or

high

er

Unk

now

n U

nkno

wn

Dou

bleT

ake,

Sty

lelis

t, an

d Pe

rson

ator

N

o sp

ecia

l re

quire

men

ts

Win

dow

s 95

or

high

er

Unk

now

n U

nkno

wn

SSA-

Nam

e/D

ata

Clu

ster

ing

Engi

ne

No

spec

ial

requ

irem

ents

Li

nux,

Uni

x,

Win

dow

s XP

, NT

Non

e U

nkno

wn

Win

Pure

N

o sp

ecia

l re

quire

men

ts

Stro

ng

depe

nden

cy o

n W

indo

ws,

but

su

ppor

ts a

ny

vers

ion

of

Win

dow

s

Not

app

licab

le s

ince

th

e pr

oduc

t is

com

plet

ely

stan

d-al

one

Non

e

24

3.2.3 Existing Applications Table 3.5 summarizes known uses for each of the products in three domains: immunization registries, other health information systems, and other people-centric database systems in general. The broader a product�s existing use, the more likely it can be adapted into new situations.

25

Tab

le 3

.5 �

Exi

stin

g A

pplic

atio

ns

Pr

oduc

t Im

mun

izat

ion

Reg

istr

ies

Oth

er H

ealth

-Car

e In

form

atio

n Sy

stem

s O

ther

Peo

ple-

cent

ric D

atab

ase

Syst

ems

Cho

iceM

aker

2.2

M

aste

r Clie

nt In

dex

(MC

I), N

ew Y

ork

City

M

aste

r Clie

nt In

dex

(MC

I), N

ew Y

ork

City

Sy

stem

s w

ith th

e U

S G

over

nmen

t an

d va

rious

bus

ines

s da

taba

ses

Dat

aSet

V

Yes,

but

spe

cific

s un

know

n Ye

s, b

ut s

peci

fics

unkn

own

Unk

now

n

DfP

ower

Ye

s, b

ut s

peci

fics

unkn

own

Unk

now

n D

irect

con

nect

ion

to o

ver 3

0 da

taba

ses

with

read

and

writ

e ca

pabi

lity

Post

alSo

ft U

nkno

wn

Blue

Cro

ss, B

lueS

hiel

d,

Pfiz

er, S

ierra

HC

, Am

eric

an M

edic

al,

Life

Scan

, Hea

lth

Net

wor

k, D

isse

llHC

, Pr

ivat

e H

CS,

Del

ta

Den

tal.

Yes,

but

spe

cific

s un

know

n

Mer

ge/P

urge

Plu

s N

one

Pfiz

er, F

irstH

ealth

Gro

up

Bank

ing,

Cre

dit c

ard

syst

em; o

ther

ki

nds

of fi

nanc

ial s

yste

ms;

real

es

tate

sys

tem

s; p

hone

and

oth

er

utilit

y sy

stem

s; a

nd re

tail

syst

ems

Dou

bleT

ake,

Sty

lelis

t, an

d Pe

rson

ator

N

one

Unk

now

n Fi

nanc

ial s

yste

ms

of a

ll ki

nds

SSA-

Nam

e/D

ata

Clu

ster

ing

Engi

ne

Cal

iforn

ia, T

exas

, and

Fl

orid

a A

wid

e ra

nge

in a

ll ar

eas

of h

eath

car

e.

Syst

ems

for g

over

nmen

t, la

w

enfo

rcem

ent,

educ

atio

n, fi

nanc

e,

insu

ranc

e, c

redi

t, re

tail,

aut

o, a

nd

tele

com

. W

inPu

re

Unk

now

n U

nkno

wn

Unk

now

n

26

3.2.4 Matching Technology The off-line evaluation looked at matching technology, at searching and user-interaction approaches, and at the degree of customization supported in both these areas. The effectiveness of a de-duplication process in a health information system depends heavily on the ability to find potential matches for a given child. If the de-duplication software�s search method were too broad, it would return too many possible matches for a given child. This would place extra burden on the user to further narrow down the matches and would raise the on-going costs of de-duplication. If the search method were too tight, it would not return valid matches and cause the de-duplication process to be ineffective. In both cases, the user might overlook existing duplicates and add new ones to the system. Part of any searching process involves comparing individual data items. Some of the products, such as WinPure, take a very simplistic approach that is based on single-field comparisons and a couple of basic comparison functions. Other products, like ChoiceMaker, use sophisticated machine-learning approaches that can automatically adjust individual data-item comparisons (or more precisely, their relative importance) to a specific locale or information system. Table 3.6 summarizes the searching and comparing techniques used by each of the selected products. In addition to describing each product�s approach to searching and comparing, it ranks their robustness and customizability as high, medium, or low. A high robustness means that the product can support sophisticated search and comparison rules, e.g. multiple data-items and fuzzy comparisons. A high customizability means that a system integrator or programmer can tune the product to better fit the unique characteristics of a given information system. Note that some products, like WinPure, offer some flexibility in how searching is done, but place the burden of using different searching techniques on the end-user. For this evaluation, such flexibility is not considered customizable since the tool is not being configured for more efficient use in the future.

27

Tab

le 3

.6 �

Sea

rchi

ng a

nd C

ompa

riso

n

Prod

uct

Sear

chin

g an

d Co

mpa

rison

R

obus

tnes

s C

usto

miz

abili

tyC

hoic

eMak

er 2

.2

Mac

hine

Lea

rnin

g an

d Pr

obab

ilist

ic a

ppro

ach.

The

y us

e St

ring

Mat

chin

g al

gorit

hms

and

man

y pr

oprie

tary

al

gorit

hms

to m

aint

ain

conf

iden

tialit

y.

Hig

h H

igh

Dat

aSet

V

Cla

ims

prop

rieta

ry m

etho

ds, b

asic

ally

com

parin

g ev

ery

reco

rd w

ith e

very

oth

er re

cord

in th

e da

taba

se.

Low

Lo

w

DfP

ower

Th

e m

atch

ing

is p

roba

bilis

tic a

nd m

achi

ne le

arni

ng. I

t al

so in

clud

es d

eter

min

istic

app

roac

hes.

H

igh

Hig

h

Post

alSo

ft Fi

nds

mat

ches

bas

ed o

n th

e st

anda

rd a

nd u

ser-

spec

ified

fiel

ds. T

he u

ser c

an s

et h

is th

resh

olds

for

com

paris

on. I

t is

a ru

le-b

ased

app

roac

h.

Med

ium

M

ediu

m

Mer

ge/P

urge

Plu

s Fi

nds

mat

ches

bas

ed o

n na

mes

and

add

ress

es o

nly.

Lo

w

Low

D

oubl

eTak

e, S

tyle

list,

and

Pers

onat

or

Find

s m

atch

es u

sing

a ru

le-b

ased

app

roac

h th

at re

ads

data

dire

ctly

from

a la

rge

varie

ty o

f dat

abas

es.

Hig

h H

igh

SSA-

Nam

e/D

ata

Clu

ster

ing

Engi

ne

Use

s a

rule

-bas

ed a

ppro

ach.

Inc

lude

s pr

e-pa

ckag

ed

sear

ch a

nd m

atch

ing

rule

s th

at w

ork

wel

l for

mos

t po

pula

tions

, but

allo

ws

cust

om ru

les

to o

verri

de th

e pr

e-pa

ckag

ed ru

les.

Hig

h H

igh

Win

Pure

Fi

ndin

g m

atch

es b

ased

on

any

sing

le c

olum

n an

d a

user

-sel

ect c

ompa

rison

met

hod.

It a

ppea

rs to

sup

port

a lim

ited

num

ber o

f bas

ic c

ompa

re fu

nctio

ns, s

uch

as 1

-ch

arac

ter e

dit d

ista

nce,

alth

ough

they

see

m to

be

very

lim

ited

and

infle

xibl

e in

sea

rchi

ng.

Low

M

ediu

m

28

In general, the more robust and adaptable a searching method is, the better its chances are of finding the best set of possible matches for a given child. After a searching operation returns a list of possible matches, something has to be done with that information. Two choices exist: a) the de-duplication software can allow a user to interactively identify the actual matches or b) the system can try to do that automatically. Table 3.7 summarizes the approach used by each of the selected products and indicates its level of customizability. Which approach is best depends on how an information system intends to integrate the de-duplication software. If integration is on the front-end, then there are distinct advantages to allowing the user to assist with match refinement activity. If the de-duplication is on the back-end, then allowing user interaction is not feasible or at least not immediately so.

3.2.5 Merging Technology The off-line evaluation looked at merging technology in four general areas: duplicate removal, data conflict resolution, and user-interaction. It also looked at the degree of customization supported in each of these areas. Duplicate removal deals with how the software coalesces redundant data. In general, there are three basic approaches: deleting duplicate records without merging values, linking matching records so that the information system can find all related records, and merging data from all matching records into one complete record. The 2nd column of Table 3.8 indicates which of these approaches each product supports. If the software supports merge matching records, then it must also deal with potential data conflicts. For example, consider a situation where a child has two records in the system, one with a birth date of 8/1/2003 and another with a birth date of 8/4/2003. Which one is correct? There are three basic approaches to resolving such data conflicts: the system gives precedence to one source of information over another (source-based precedence), lets the user choose which value to keep (user-directed), or simply keeps

Table 3.7 � User-interaction For Match Refinement (Clustering)

Product Supports User Interaction

Customizability

ChoiceMaker 2.2 Yes High DataSet V No Medium DfPower Yes High PostalSoft Yes Low Merge/Purge Plus Yes Medium DoubleTake, Stylelist, and Personator

Yes Medium

SSA-Name/Data Clustering Engine

No Medium

WinPure Yes Low

29

both values (data stacking). The 3rd column of Table 3.8 summarizes how the products deal with data conflicts, if applicable. The 4th and 5th columns of Table 3.8 describe the level of user interaction and customizability supported by each product for merging activities.

Table 3.8 � Summary of Merging Technology

Product De-duplication Removal

Data Conflict Resolution

User-Interaction Customizability

ChoiceMaker 2.2 Supports merging and deleting duplicate records

Source-based precedence and data tacking

None High

DataSet V Supports merging and deleting

Supports data stacking

User can merge questionable matches. Others are handled automatically.

Low

DfPower Supports deleting, linking, and merging of matching records

Source-based precedence and data stacking

High Medium

PostalSoft Supports merging and deleting

User-directed User performs manual merges and deletes, although there are some tools for selecting multiple records at a time

Low

Merge/Purge Plus Suppression, or purge, files can be used to eliminate unwanted records.

Not Supported Low Low

DoubleTake, Stylelist, and Personator

Merge files with different field structures directly.

Supports data stacking

Medium Low

SSA-Name/Data Clustering Engine

None � only reports potential matches

Not supported None Not applicable

WinPure Deleting duplicate records without merging data

Not supported User chooses which of the matching records to delete

Low

3.2.6 Product Support The products vary widely in the type and level of support that they offer. Table 3.9 summarizes the available support in four areas: on-line, telephone, training, and consulting.

30

3.3 Benchmark Evaluation The benchmark evaluation sought to compare the setup costs, accuracy, and speed of the selected de-duplication products. Unfortunately, the project team could only obtain an evaluation copy of FirstLogic�s PostalSoft, so a comparative analysis of multiple benchmark evaluations is impossible here. Nevertheless, we believe this single evaluation still provides significant value because it illustrates a relatively simple and systematic method for conducting a benchmark evaluation. The method consists of four basic steps:

1. Establish evaluation criteria and test techniques 2. Set up and learn the software product 3. Measure the product against the evaluation criteria 4. Compile and interpret the results

Sections 3.3.1-3.3.4 describe these steps and, where appropriate, show results from the PostalSoft evaluation.

3.3.1 Step 1 - Benchmark Evaluation Criteria and Testing Techniques The first, but most-often forgotten, step of any product evaluation is to establish a set of criteria against which the product will be tested. By doing this ahead of time, the evaluation can remain focused and unbiased. Without a pre-determined set of criteria, testers will have a tendency to report on what the product does and not on what it should do. Of course, the challenge is to come up with meaningful, discriminating criteria. Doing this well requires a preliminary exploration of similar products to become better acquainted with background concepts and to gather ideas about expected or interesting features. With this background information, a tester organizes and writes evaluation

Table 3.9 � Summary of Product Support

Product On-line Telephone Training Consulting ChoiceMaker 2.2 Free Fee-based Fee-based Fee-based DataSet V Free Free* None None DfPower Free Fee-based None None PostalSoft Free Free None None Merge/Purge Plus Free Fee-based Fee-based None DoubleTake, Stylelist, and Personator

Free Free None None

SSA-Name/Data Clustering Engine

Free Fee-based Fee-based None

WinPure Free Fee-based None None

* Limited services provided for free

31

criteria, which are measurable conditions that discriminate whether a product meets expectations. For each criterion, a tester should also describe one or more techniques or test cases for evaluating the product against that criterion. Table 3.10 lists the criteria and testing techniques for this benchmark evaluation, organized by setup, accuracy, and performance. Since the nature and complexity of the setup process can vary considerable depending on the product�s integration type, the criteria in this area tries to ascertain whether there is sufficient documentation for installation, configuration (or training), and integration with an integrated information system. The testing techniques for these criteria require subjective judgments based on observations made during the installation, configuration, and integration processes. The criteria for the accuracy area consider whether the product can correctly identify duplicates in the presence of common data errors without mistakenly matching any records that are not actually duplicates. The testing techniques for these criteria prescribe using the CDC de-duplicate testing toolkit [5]. This toolkit includes a test data set of 550 records, 251 of which are duplicates, and tools for analyzing the results. The duplicate records differ from their matches in ways commonly found in immunization registries (see Table 3.11.) More about this toolkit is available on the CDC website [http://www.cdc.gov]. The test data also contains pairs of records that are similar but should not considered duplicates. They are used to check whether the product incorrectly matches records that shouldn�t be matched. Table 3.12 describes the different categories of similarity and shows the number of records in each one. The criteria and testing techniques for the execution speed area try to determine if the product will perform well for an average-sized integrated child health information system. Among the projects summarized in Section 4, the annual birth cohorts range from 6,000 to 125,000 with the average of 46,000. If an integrated child health information system kept records on-line for 10 years, then it would be reasonable to expect the database to contain 500,000 records at any point in time.

32

T

able

3.1

0 �

Ben

chm

ark

Eva

luat

ion

Cri

teri

a Ar

ea

Crit

eria

Te

stin

g Te

chni

que

Mea

sure

men

t Se

tup

The

prod

uct s

houl

d co

me

with

det

aile

d in

stru

ctio

ns fo

r ins

talla

tion

and

conf

igur

atio

n (o

r tra

inin

g).

Stud

y th

e in

stru

ctio

ns a

nd d

o th

e in

stal

latio

n an

d co

nfig

urat

ion

(or t

rain

ing)

. Kee

p no

tes

durin

g th

e pr

oces

s ab

out t

he k

inds

of

prob

lem

s th

at a

rise.

Mak

e a

subj

ectiv

e as

sess

men

t of p

rodu

ct

with

resp

ect t

o th

is c

riter

ion.

Rat

e on

a s

cale

of 0

to 4

, with

4

bein

g th

e hi

ghes

t and

mea

ning

th

at a

ll ex

pect

atio

ns w

ith re

spec

t to

this

crit

erio

n w

ere

met

. If

the

prod

uct i

s st

and-

alon

e it

shou

ld p

rovi

de

suffi

cien

t doc

umen

t for

set

ting

up th

e im

porti

ng

and

expo

rting

of t

he d

ata;

oth

erw

ise,

it s

houl

d pr

ovid

e su

ffici

ent d

ocum

enta

tion

for i

nter

faci

ng

it w

ith a

n in

form

atio

n sy

stem

.

Stud

y th

e ap

prop

riate

doc

umen

tatio

n an

d se

t it u

p fo

r eith

er

impo

rting

dat

a or

acc

eptin

g da

ta fr

om a

test

info

rmat

ion

syst

em.

Rat

e on

a s

cale

of 0

to 4

, with

4

bein

g th

e hi

ghes

t and

mea

ning

th

at a

ll ex

pect

atio

ns w

ith re

spec

t to

this

crit

erio

n w

ere

met

.

The

conf

igur

atio

n or

trai

ning

sho

uld

be s

aved

so

it d

oesn

�t ha

ve to

redo

ne e

very

tim

e th

e pr

oduc

t is

used

or u

pdat

ed.

Test

to s

ee if

the

conf

igur

atio

n is

per

sist

ent.

Run

mul

tiple

tim

es

and

note

if a

ny p

art o

f the

set

up p

roce

ss h

as to

be

redo

ne.

Also

, re

inst

all t

he p

rodu

ct a

nd n

ote

how

muc

h of

the

conf

igur

atio

n is

re

mem

bere

d.

Rat

e on

a s

cale

of 0

to 4

, with

4

bein

g th

e hi

ghes

t and

mea

ning

th

at a

ll ex

pect

atio

ns w

ith re

spec

t to

this

crit

erio

n w

ere

met

. Ac

cura

cy Th

e pr

oduc

t sho

uld

corr

ectly

iden

tify

90%

or

mor

e of

the

dupl

icat

e re

cord

s in

a te

st d

ata

set

with

kno

wn

dupl

icat

es th

at re

pres

ent c

omm

on

data

pro

blem

s.

Usi

ng th

e C

DC

test

dat

a se

t [5]

, run

the

mat

chin

g so

ftwar

e an

d ge

nera

te a

list

wha

t it d

eter

min

es to

be

certa

in d

uplic

ates

(can

be

aut

omat

ical

ly m

erge

d or

reje

cted

) and

pot

entia

l dup

licat

es

(thos

e th

at n

eed

hum

an re

solu

tion)

. An

alyz

e th

ese

resu

lts fo

r ty

pe o

f dat

a pr

oble

ms

iden

tifie

d by

CD

C.

See

Tabl

e 3.

11

Perc

ent o

f kno

wn

dupl

icat

es

corr

ectly

mar

ked

as e

ither

cer

tain

du

plic

ates

or p

oten

tial d

uplic

ates

, br

oken

dow

n by

type

of d

ata

prob

lem

.

The

prod

uct s

houl

d do

not

mar

k an

y du

plic

ates

th

at a

ren�

t act

ually

dup

licat

e �

no fa

lse

posi

tives

.

Usi

ng a

test

dat

a se

t, ru

n th

e m

atch

ing

softw

are

and

reco

rd w

hat

it de

term

ines

are

not

dup

licat

es.

Brea

k th

e re

sults

dow

n by

type

of

dat

a pr

oble

m.

See

Tabl

e 3.

11

Perc

ent o

f kno

wn

non-

dupl

icat

es

corr

ectly

iden

tifie

d by

the

softw

are

as n

on-d

uplic

ates

. Ex

ecut

ion

Spee

d

The

prod

uct s

houl

d de

term

ine

whe

ther

any

gi

ven

reco

rd is

a d

uplic

ate,

pot

entia

l dup

licat

e,

or n

on-d

uplic

ate

in a

reas

onab

le a

mou

nt o

f tim

e (le

ss th

an s

ever

al s

econ

ds.)

Usi

ng a

test

dat

a se

t with

kno

wn

dupl

icat

es, r

un th

e m

atch

ing

softw

are

for a

t lea

st 1

,000

reco

rds

agai

nst a

dat

abas

e of

at l

east

50

0,00

0 re

cord

s. R

ecor

d th

e tim

e it

take

s to

pro

cess

the

1,00

0 re

cord

s.

Aver

age

cloc

k-tim

e / r

ecor

d C

PU ti

me

/ rec

ord

IO w

ait t

ime

/ rec

ord

33

Table 3.11 � Common Types of Data Problems that Among Duplicate Records

Duplicate Problem Types

Description Count

First Name Spelling Nicknames, typos, or variations of first name. These can sometimes match by Soundex or partial matching.

51

Last Name Spelling Typos or misspellings of last name. These can sometimes match by Soundex or partial matching.

24

First Name Hyphenation

Hyphenated first name has missing hyphen or missing one part of name.

15

Last Name Hyphenation

Hyphenated last name has missing hyphen or missing one part of name.

23

First Name Reversed w/Last Name

First name has been reversed with last name; for some names not easy to distinguish

4

First Name Reversed w/Middle Name

First name has been reversed with middle name 4

Middle Name Reversed w/Last Name

Middle name has been reversed with last name. 4

Different Last Name Last name is totally different due to re-marriage, foster care, or other reasons.

14

First Name as �Baby� Child has been entered into the system possibly with hospital data prior to naming.

9

Suffix included in First Name

Suffix erroneously included in first name field. 7

Suffix included in Last Name

Suffix erroneously included in last name field. 5

Date of Birth Difference

Date of birth for same person does not match due to error in day, month, year or some combination of these.

61

Gender Difference Gender for same child does not match other record due to error.

4

Duplicate Core Data (first,last,DOB,sex)

The first name, last name, date of birth and gender fields are identical in both records although other fields may not completely match. These cases are common and normally not considered a problem by registries.

16

Exact Duplicate (all demographic fields)

Every demographic field is an identical duplicate. (This includes the child names, mother names, DOB, & gender. Some may even have identical vaccines as would occur when an electronic submission is re-sent.) These cases are common and normally not considered a problem by registries.

10

Total Duplicates 251 Note that these duplicate problem types and their meanings come from the reporting categories of CDC�s de-duplication analysis tool. See User Manual for the De-duplication Toolkit, p. 7 [14].

34

Table 3.12 �Types of Similar, but Non-duplicate Records in CDC Test Set

Type of Similarity

Description

Notes Count

First Name Spelling

For each case, two records have same last names and same DOB but they have first names spelled different and mothers are different.

Records could be confused as same person with spelling first name problem.

6

First Name Spelling + Diff DOB

For each case, two records have same last name but first name and DOB have some differences. (May still be similar but not exact match). Mothers are different.

Records could be confused as same person with first name and DOB errors.

8

Last Name Spelling

For each case, two records have same first names and same DOB but they have last names spelled different and mothers are different.

Records could be confused as same person with spelling last name problem

8

First Name Hyphenation

For each case, two records have the same DOB and last name and have a first name hyphenation difference. Mothers are different.

Records could be confused as same person with a first name hyphenation problem.

2

Last Name Hyphenation

For each case, two records have the same DOB and first name and have a last name hyphenation difference. Mothers are different.

Records could be confused as same person with a last name hyphenation problem.

2

First Name Reversed w/Last Name

For each case, two records have the same DOB and have reversed first and last names from each other. Mothers are different.

Records could be confused as same person first and last name switched.

2

First Name Reversed w/Middle Name

For each case, two records have the same DOB and last name and have reversed first and middle names from each other. Mothers are different.

Records could be confused as same person first and middle name switched.

2

First Name Reversed w/Middle Name and Diff DOB

For each case, two records have different DOBs and reversed first and middle names. (May still be similar but not exact match). Mothers are different.

Records could be confused as duplicate with reverse first/middle and DOB errors.

2

Middle Name Reversed w/Last Name

For each case, two records have the same DOB and first name and have reversed middle and last names from each other. Mothers are different.

Records could be confused as same person middle and last name switched.

2

Different last name

For each case, two records have same first names and same DOB, but they have different last names and mother data.

Records could be confused as same person with different last name

14

First Name as �Baby�

For each case, two records have �baby� as part of first name, same last name, and same date of birth. Other fields differ.

Compares to duplicate cases where �baby� was part of first name.

2

35

Date of Birth Difference

For each case, two records are two people with same or similar names and different date of birth and mother data.

Records could be confused as same person with date of birth error.

12

Gender Difference

For each case, two records have similar first names, same last name, and same DOB but they have different gender and mother data.

Records could be confused as same person with an error in gender code

2

Duplicate Core Dat(first, last, dob, sex

For each case, two records have same first and last names, date of birth, and gender - but different middle names and mother data.

Records could be confused as same person.

2

Multi-births For each case, two or more records represent twins and triplets. All fields match except for first name, maybe middle. (Mothers are the same).

Records could be confused as same person with first name error.

10

Siblings For each case, two records represent two brothers and/or sisters. All fields match except DOB, first and middle names. (Mothers are the same).

Records could be confused as same person with first name and DOB errors.

4

Cousins For each case two records have similarities in some fields � could be last names or mom�s names

Records could be confused as the same person with first name, last name, and DOB errors.

4

Soundex match For each case, two records have same DOB. First and/or last names will match based on soundex but don�t really look that much alike. (E.g. Morgan and Morrison). Mothers are different.

Records could be confused as same person with first name and/or last name errors, if too much reliance on soundex.

6

Total non-duplicates that look like duplicates 90 Note that these similarity types and their meanings come directly from the reporting categories of CDC�s de-duplication analysis tool. See User Manual for the De-duplication Toolkit, p. 9 [14].

36

3.3.2 Step 2 - Setup and Learn the Product After establishing the evaluation criteria and testing techniques, the next step is to install the product, configure or train it, integrate it into a simulated information system, and then become proficient in its use. Since the setup is one of the evaluation areas with testing techniques, testers need to keep notes about the process and then make some subjective assessments about the product.

3.3.3 Step 3 - Measure the Product Against the Evaluation Criteria The next step is to test against the evaluation criteria using the techniques specified with the criteria in each of three areas. 3.3.3.1 Setup Table 3.13 summarizes the results for the PostalSoft evaluation in the setup area. Note the ratings are subjective and, because there were no other products to compare against, they offer little insight except to say that PostalSoft fell short of the expectations, particularly with respect to the third criterion.

Table 3.13 � Evaluation Results for PostalSoft in the Setup Area

Criteria Notes Rating The product should come with detailed instructions for installation and configuration.

The set up process had some problems. Technical support was called (several times?) to resolve those problems.

2

If the product is stand-alone it should provide sufficient documentation for setting up the importing and exporting of the data. If the product is a server-based system or SDK, it should provide sufficient documentation for integrating it into an existing information system.

Although the product came with reasonable documentation for doing the import, the process was not as smooth as expected. One call to a support line was necessary.

3

The configuration or training should be saved so it doesn�t have to be redone every time the product is used or updated.

Although it allows the user to save custom matching rules, it does not allow the user to save all of the necessary configuration parameters for importing and exporting.

1

3.3.3.2 Accuracy Since the PostalSoft product is a stand-alone system, the testing techniques for the criteria in the accuracy area requires the tester to do the following:

1. Load the test data into PostalSoft. 2. Use the product to remove the duplicates. This involves selecting or creating a set of

matching rules, running the match/merge tools, and reviewing the match groups (clustering).

37

3. Export the cleaned data from PostalSoft. 4. Analyze the cleaned data using CDC�s results analyze tool, which is part of the de-

duplication toolkit [5]. 5. Repeat steps 1 through 4 with a different set of matching rules.

Table 3.14 summarizes the best overall results2. Out of 251 true duplicates, PostalSoft found 239 (95%) without mistakenly matching any non-duplicates. Note that most of the missed duplicates came from differences in first names.

Table 3.14 � Summary of De-duplication Accuracy in PostalSoft

Duplicate Problem Types

True Duplicates

Duplicates Found

Duplicates Missed

Percent Found

First Name Spelling 51 48 3 94.12% Last Name Spelling 24 24 0 100% First Name Hyphenation 15 12 3 80% Last Name Hyphenation 23 22 1 95.65% First Name Reversed w/Last Name

4 1 3 25%

First Name Reversed w/Middle Name

4 4 0 100%

Middle Name Reversed w/Last Name

4 3 1 75%

Different Last Name 14 14 0 100% First Name as �Baby� 9 9 0 100% Suffix included in First Name 7 6 1 85.71% Suffix included in Last Name 5 5 0 100% Date of Birth Difference 61 61 0 100% Gender Difference 4 4 0 100% Duplicate Core Data (first,last,DOB,sex)

16 16 0 100%

Exact Duplicate (all demographic fields)

10 10 0 100%

Total 251 239 12 95.22%

In addition to the above results, the tester found that PostalSoft�s matching algorithm was limited. It allows matching on eight fields, and only three of them can be user-defined. The rest have to come from a list of pre-defined fields that primarily include name and address. Its pre-defined list doesn�t contain any field for birth date, mother information, or father information. So, for a matching algorithm to include any of these other pieces of data, it had to do so via one of three custom fields. Also, the accuracy of PostalSoft�s matching algorithm is completely dependent on how the user set up the matching rules. Although it provides a number of pre-defined rule sets, most of them were oriented more to address-book de-duplication and none of them took advantage

2 This result was achieved with a custom set of matching rules that considered the child�s last name, first

name, middle name, birth date, gender, mother�s last name, and mother�s first name. Significantly fewer duplicates were found with other rule sets.

38

of birth date and parent information. Non-technical users may have a difficult time selecting an approach to choosing a set of matching rules or setting up new ones. 3.3.3.3 Execution Speed Although execution speed seemed like an important characteristic to test, the criteria reasonable, and the testing technique justified, this part of the evaluation turned out to be impractical. Given the budget and schedule constraints of this project, the project team was unable to construct a meaningful test data set of the required size (500,000 records). Creating such a test data set would be of great value and an important objective of a possible future project.

3.3.4 Step 4 � Compile, Interpret, and Document the Results The last step is to compile, interpret, and document the results in a form that can be easily communicated and understood. Typically, a full de-duplication tool study for an integration project would test multiple products. The results for each of these evaluations would then provide a basis for making comparisons and drawing conclusions. Since this project included only one sample benchmark evaluation, examples of such comparisons and interpretations cannot be given here. However, in general, the key to this step is to make comparisons that are well formed (based on common measures, characteristics, etc.) and interpretations that are supported by the data, and then to write them up in a form that clearly and concisely communicates the results.

3.4 Discussion Completing a meaningful tool evaluation for any kind of product is a non-trivial activity. However, there are some special challenges in evaluating de-duplication products for integrated health information systems. This section identifies and discusses four of these challenges.

3.4.1 Finding common basis for comparison The most notable one is finding a common basis for comparison. As discussed earlier, the technologies behind de-duplication solutions vary with respect to the parts of the problem they address, the types of data cleaning that they provide, the sophistication of the matching algorithms, and their support for record merging or linking. Also, the products offer different opportunities for integrating de-duplication into existing information systems (or not). Some are geared toward one-time or periodic use, as is the case for most stand-alone products. Others, like server-based products, provide a variety of connectivity options. Still others simply consist of software components that programmers can reuse in their own software. Finally, the products differ in terms of how they are packaged, licensed, and supported. So, finding a common basis for comparison is difficult and requires some serious forethought. To meet this challenge, it is important to

1. Stay focused on those issues that are important to an particular integration project 2. Establish a set of evaluation criteria that is meaningful with respect to those issues

39

3. Measure products against the criteria using pre-planned testing techniques or test cases

3.4.2 Obtaining evaluation software The second challenge is obtaining evaluation software. The project team found that most of the vendors were not willing to let others test their products. However, several offered to run our test data through their system themselves and then send us the results. There are several possible reasons that might explain the vendors� reluctance. First, they did not perceive this project as resulting in a potential sale and therefore not worth the effort. Obviously, this issue would not exist if an individual project conducted an evaluation for the purpose of making a purchase. Second, many of products are not easy to set up, configure, or train. For the evaluation to be fair and represent a product in its best light, the vendor might have to invest considerable time and effort in assisting with the evaluation. This issue might be more true for served-based and SDK products than for stand-alone products, since their setup is typically considerably more involved.

3.4.3 Obtaining or creating meaningful test data The third challenge is obtaining or creating meaningful test data. Although the CDC test data set is good, it is small and oriented towards immunization registries. It may not reflect the kinds and percentages of errors that occur in a given integrated child health information system. As a result, the evaluation could produce misleading results. For example, if an information system contains a high percentage of duplicates with first-name spelling problems but the test data only has a few and the products don�t do well with first-name matching, then the test results would be higher than they should be.

3.4.4 Interpretation of results The fourth challenge is properly interpreting the results of de-duplication activities. Each product may use different terminology and report results in a slightly different fashion. This can easily cause the tester to misunderstand what is actually taking place. For example, to analyze the number of duplicates found by category, the CDC analysis tool relied on the de-duplication assigning the same �Patient Id� to all the records in a cluster of possible matches. PostalSoft, however, did not do this directly (or so it seemed). As a result, the tester made a serious error in preparing the data for the analyzer, which altered what the analyzer thought it would match and led to inaccurate accuracy statistics. Fortunately, the error was discovered, and as it turns out, PostalSoft does produce a matching group number that could serve the same purpose as the Patient Id. Although this was a case of human error, it does illustrate a potential problem for any de-duplication tool evaluation. Because of its inherent complexity and the wide range of products, it is easy to misinterpret the results for any particular accuracy test case. Testers must ensure that they fully understand the product and the meaning of the data.

40

4. Review of De-duplication in Integrated Child-Health Information Systems in Eight Connections Projects A primary goal of this project is to review existing integrated child-health information systems and report on what they are doing in terms of de-duplication so that others can learn from their experiences. Since the Connections group consists of public-health agencies that are all attempting to integrate child-health information systems and are willing to share their experiences, its members and their corresponding projects became the subjects of this review. Descriptions of these projects, based on Project Briefs dated December 2001, are available on the Connections website [11]. The review proceeded in three steps:

1. Initial survey instrument developed with participant input 2. Analysis of initial survey responses 3. Review of individual project scope and objectives 4. Summary of projects and their de-duplication issues

The initial survey instrument, shown in a condensed form in Figure 4.1, sought information about the scope of the integrated child-health information systems and their current status with respect to de-duplication. This survey was sent out to all Connection members via e-mail and discussed on several Connection conference calls. Eight groups in seven different jurisdictions responded with sufficient details to proceed to the next steps. Tables 4.1 � 4.9 summarize the findings of the survey responses in the following areas.

1. Birth Cohorts 2. Health-care programs involved in the integration 3. Use of a master or individual indices 4. Sources considered most authoritative for demographic information 5. Degree of automation for de-duplication activities 6. Front-end vs. Back-end de-duplication 7. Data elements used in the matching process 8. Quality assurance procedures 9. Use of off-the-shelf software

It�s important to note that the initial survey captured a static view of the integration projects and their use of de-duplication technology for a single point in time. This helped the project team do some rough analysis and organization of ideas prior to conducting individual project reviews. The individual project reviews focused on obtaining a more dynamic understanding of each individual project. Through one-on-one phone calls, e-mail messages, and exchange of project documentation, the project team was able to probe de-duplication issues that were

41

unique to the individual projects and meaningful with respect to their current efforts. Some of the projects, such as Rhode Island�s KIDSnet, for example, are in the process of re-designing their systems and so the review looked at both current and future use of de-duplication technology. Other projects, like Utah, are still waiting on the resolutions of a few organizational issues before deploying their integrated health-care system, so the review focused on de-duplication in individual participating programs, the introduction of a new Birth Record Number for three of those programs, and on what the integrated system will be able provide in the future. The project team summarized the information gathered during the individual project reviews in eight project abstracts. These abstracts appear at the end of this section, in descending order based on the number of programs that they involve.

42

Figure 4.1a � Condensed version on the initial survey, page 1

Initial Survey on

Record Matching and De-duplication Technologies for Child Health Integrated Systems

This survey aims to gather basic information about de-duplication software and procedures currently being used in child health integrated systems. Please take a few minutes to answer the questions below in the context of your integrated system. Feel free to contact Stephen Clyde at 435-797-2307 or [email protected] if you have questions. Name: Organization: Integration Project:

1) How large is your Birth Cohort? ________ (Births/Year)

2) Which health-care systems are currently involved in your integration project? (check all that apply)

! Immunization ! Newborn screening ! Hearing screening ! Lead screening ! Vital Records ! Early Intervention ! Women, Infants, and Children (WIC) ! Birth Defects Registry ! Medicaid ! Family Services ! NEDDS ! Other 3) Does your integrated system keep a separate child or person index that acts as a “master” or

“authority” for the de-duplication process? ! Yes (skip to question 5) ! No 4) Which of the participating systems (those listed in question #3), if any, keep a child or person

index that acts as an “authority” in the de-duplication process? ! Immunization ! Newborn screening ! Hearing screening ! Lead screening ! Vital Records ! Early Intervention ! Women, Infants, and Children (WIC) ! Birth Defects Registry ! Medicaid ! Family Services ! NEDDS ! Other ! None 5) Among all the participating systems, which one is considered to have the most authoritative

demographic information about persons? ! Immunization ! Newborn screening

! Hearing screening ! Lead screening

43

Figure 4.1b � Condensed version on the initial survey, page 2

6) Would you consider your de-duplication process to be fully automated, manual, or semi-

automated. ! Fully automated ! Manual ! Semi-automated If semi-automated, how does the user interact with the software to do de-duplication?

7) Is your overall process a front-end or back-end or some combination approach? (check all that

apply)

! Front-end (record searching, matching, and merging occurs prior to entering new records into the database)

! Back-end (record searching, matching, and merging occurs after entering new records into the database)

! Other, please specify: ______________________________________

8) If your system uses a front-end approach, please answer following:

a. What percentage of records matched existing records in the system as they are entered or imported? _____ % (estimate the percentage the best you can).

b. What percentage of entered or imported records did not match any existing record,

but probably should have been? _____ % (estimate the percentage the best you can).

c. Is your record front-end matching software based on probabilistic record matching? Typically, a probabilistic approach will return search results that list potential matches with some kind of score or ranking that indicates how likely each one is to be an actual match.

! Yes ! No ! Don’t know

d. Is your front-end matching software based on machine-learning technology? You can

assume that it is if someone had to “train” the system on known matches prior to being used for real.

! Yes ! No ! Don’t know

44

Figure 4.1c � Condensed version on the initial survey, page 3

9) If your system uses back-end de-duplication, please answer the following: a. What percentage of the records entered or imported into your integrated system are

duplicates? ____ % (estimate percentage the best you can).

Note: If your integrated system uses a central database, then this is the percentage of duplicates that get into that database. If your system involves multiple databases, then this is the percentage of unlinked or uncorrelated duplicates across all those participating databases.

b. Eventually, the back-end de-duplication process finds what percentage of the

duplicate records in your integrated system? ____ % (estimate the percentage the best you can).

c. Is your back-end de-duplication software based on probabilistic record matching?

Typically, a probabilistic approach will return search results that list potential matches with some kind of score or ranking that indicates how likely each one is to be an actual match.

! Yes ! No ! Don’t know

d. Is your back-end de-duplication software based on machine-learning technology? You can assume that it is if someone had to “train” the system on known matches prior to being used for real.

! Yes ! No ! Don’t know

10) On which data elements or combinations of data elements does your de-duplication software base its matching (e.g., child’s last name, birth date, mother’s last name, etc.)? List them in order of most important (or most heavily weighted) to least.

11) Did you test other data elements or combinations of data elements prior to arriving at present

configuration? ! Yes ! No

12) Briefly describe your quality assurance procedures or constraints for checking incoming

information from Clinics and Hospitals, if any? For example, when adding or importing a child’s record into the system, does your system require certain data elements, like child’s last name and birth date, to be present? Or, do birth dates have to be complete or can they be approximate (i.e. just the month and year)?

13) Does your integrated system use any off-the-self software for record matching or de-

duplication?

! No ! Yes, please provide product names and vendors 14) Who are the best persons to contact for further technical details related to record de-

45

Table 4.1 - Integration Projects and Their Birth Cohorts

Project Organization Birth CohortKIDSNET RI, DOH 13,500FAMILYNET OR, DOH 46,000ALERT OR, Imm. 47,000Master Child Index (MCI) NYC, DOHMH 125,000MOHSAIC MO, DOHSS 75,000Community Early Childhood Screening & Tracking

KS, DOH 6,500

ImmPact ME, BOH 13,000CHARM UT, DOH 47,000

Table 4.2 - Health-care programs involved in the integration

Involved Systems

Project IR HS NBS LS VS EI WIC BDR MC FS NEDDS PRAM other KIDSNET Y Y Y Y Y Y Y Y Newborn

Development Risk Assessment; Home Visiting

FAMILYNET Y Y Y Y Y Y Perinatal & Child health programs

ALERT Y Y MCI Y Y Y Communicable

Disease Surveillance System in the future

MOHSAIC Y Y Y Y Community Early Childhood Screening & Tracking

Y Y ?

ImmPact Y Y Y Y CHARM Y Y Y

IR Immunization Registry HS Hearing Screening NBS Newborn Screening LS Lead Screening VS Vital Statistics EI Early Intervention WIC Women, Infants, and Children BDR Birth Defects Registry MC Medicaid FS Family Services NEDDS National Electronic Disease Surveillance Systems PRAM

46

Table 4.3 - Use of a master or individual indices

Person Indices or Authorities

Project

Uses of Master Index IR HS NBS LS VS EI WIC BDR MC FS NEDDS PRAM other

KIDSNET YES FAMILYNET YES ALERT No Vital Records

for DOB only MCI YES MOHSAIC No Community Early Childhood Screening & Tracking

No

ImmPact No Active Medicaid data owner id.

CHARM Yes

Table 4.4 - Sources considered most authoritative for demographic information

Demographic Authority Project IR HS NBS LS VS EI WIC BDR MC FS NEDDS PRAM Other KIDSNET Newborn Developmental Risk

Assessment, which is being integrated with Vital Records

FAMILYNET Y VS and NBS also maintain their own data systems and provide data for the integrated system. Between those two, VS is considered as the most authoritative.

ALERT VS is the most authoritative source for dates of birth. VS addresses are generally not good, as many families move after the birth of a child. ALERT gets demographics from several sources for each child.

MCI Y

MOHSAIC Y

Community Early Childhood Screening & Tracking

ImmPact Active Medicaid

CHARM Y IR is considered the most authoritative for contact information

47

Table 4.5 - Degree of automation for de-duplication activities

Automation

Project FullManual Semi Notes KIDSNET Y Y Y The matching process is fully automated on the front end and manual on the back

end. Text files are imported and matched using established algorithms. Any incoming records that do not match undergo human review and are adjusted for spelling or DOB so that they match and get imported. The merge process is manual, though a software development project is underway that would allow semi-automated merging. The user would need to determine which data elements he wants to keep between the two records and then the system would automatically merge the data elements into a single child record.

FAMILYNET Y The current, back-end process for WIC and public health immunization records produces matched records that user manually examine and de-duplicate. The current front-end linking of newborn birth certificate, heel-stick, and hearing screening data requires clerical review of records that do not meet the match weight criteria for automated linking. We are developing a more sophisticated system that will combine these approaches.

ALERT Y Custom designed de-duplication software (called RESOLVE) checks records at the time of import. It identifies children who cannot be matched up with an existing unique ID and whose name and date of birth indicate a possible match with an existing child in ALERT. Children with existing records automatically get updates to their existing shot records without manual review if the unique ID is the same. Records specialists use RESOLVE to examine records and merge children with same/similar last names and/or dates of birth. Specialists validate discrepancies in DOBs with birth records. Using strict matching rules, they examine the records and manually deduplicate the children and immunizations. All permanent record merging happens in RESOLVE. ALERT also uses RESOLVE to store "matched lists" of names from Soundex, etc. These files speed the human review time for permanently merging records in RESOLVE.

MCI Y De-duplication clerks review pairs of records that the Choicemaker program cannot determine to be a definite match, but have a high score for potentially being the same.

MOHSAIC Y Working towards automation, using various tools to cluster candidate listings,, manual review/or business rules determine records to be merged or deleted. The merge process itself is semi-automated. The matching process identifies the duplicates, the process of removing duplicates itself can be incredibly complex. For example, the MOHSAIC application uses the person's or organization�s ID as a foreign key in dozens of tables and each software release may add additional tables. We have to maintain a special application just to track down which tables have the client's keys in it, and then determine how to merge the data. We have not yet developed an app for provider keys that are duplicated. Organizations are more difficult to match, and undoing an incorrect merge is much more difficult than fixing a client's record.

Community Early Childhood Screening & Tracking

Y De-duplication is embedded in the clinic support system which includes all public health clinic activities at the given health department. In other words, we are working with all the clients that have presented themselves at the given health department in their various clinic offerings.

ImmPact Y Uses a screen to display pairs of records that meet criteria indicating potential matches. The user decides to merge or disassociate each pair of records. Medicaid patient records are automatically merged with existing patients where there is an exact match on first name, last name, middle initial, date of birth and SSN.

CHARM Y A staff member resolves matches that are close, but not exactly the same.

48

Table 4.6 - Front-end and back-end de-duplication

Front-end Details Back-end Details

Project Match False Negative Prob. Learning Duplicates Match Prob. Learning

KIDSNET 50% 30%No No 1% 90% No No FAMILYNET 10% 5%Y No 10 93% No No ALERT 84% 14%No No 14 96% No No MCI ? ?Y Y ? ? Yes Yes

MOHSAIC 90% 7%No No 7 3% No No Community Early Childhood Screening & Tracking

Y No Yes No

ImmPact 65% 20%Y No 50 40% Yes No CHARM Y Yes

*Note that all the projects reported that they support some degree of front-end and back-end de-duplication.

49

Table 4.7 - Data element used in the matching process

Data Elements

Project Matching Elements Tested Others

KIDSNET For matching incoming records, if immunization process is taken as an example, primary matching is done on provider's medical record number, child's first name, child's last name, date of birth (all equally weighted). Secondary matching (if no primary match) is done child's first name, child's last name, date of birth (all equally weighted). Third match (if no secondary match) is done on child first name alias, child last name alias, child date of birth, parent first name, parent last name (all equally weighted). Many of the other match processes (lead screening, WIC, EI) are similar. For de-duplication, most potential duplicates are found as a by-product of other processes (immunization error resolution, matching prenatal visits with postnatal information for the same child). Alternatively, we use a query based on matching Soundex of child's first and last names, and exact match on date of birth.

No

FAMILYNET Data elements used by the current front-end, probabilistic matching system (97% match): metabolic ID; child's birth date; child's data of birth, child's gender; birthing facility; child's last name; mother's last name. Data elements to be tested for new back-end, probabilistic matching system: child's birth date; child's gender; child's first name; mother's first name; primary address; primary telephone number; child's last name; mother's last name.

Y

ALERT Date elements include Last name, date of birth, and a unique ID. Specialists can also use immunization histories. There must be a match of at least two shot series from different days, addition to an exact match on data of birth, a exact or partial match on name, and a match on one other identifying element such as address, Medicaid #, etc.

Y

MCI Data elements that indicate a child is a twin are weighted the most heavily. Another important clue is a name swap where the first and last names are switched. Bin number, which is a unique identifier of the address, is also very important. Other clues of importance are first and last name (first name being more heavily weighed), child's date of birth and mother's maiden name and date of birth.

Y

MOHSAIC Recent expedient trials have lead to the following key arrangements, ordered with most aggressive first. 1) child's first name, child's last name, child's data of birth, child's gender. This arrangement has about a low risk of false positives and typically yields an aggressive match rate of 80 - 90%. 2) child's date of birth, birth order, Mother's SSN. When Mother's SSN is prevalent this arrangement is a very aggressive. 3) Mother's first name, mother's last name, child's data of birth, child's gender, birth order. This arrangement is like #1, except that combining mother's names with birth order is a good indicator when child's name is less prevalent in the data. 4) Birthing facility, child's medical record number, child�s date of birth. This arrangement is moderately aggressive in current data but has great potential when facility and child's MRN are prevalent and accurate. 5) Birthing facility, mother's MRN, child date of birth, and birth order. Comparable to above to #5. 6) Address (scrubbed to significant numbers and name), child's date of birth, child's time of birth, and birth Order. This one has a slight to moderate strength. Has possibility of small number of false positives; none have been manifest in trial data. In addition to the above data elements, telephone number has demonstrated some value in clustering household and organization data.

Y

Community Early Childhood Screening & Tracking

ImmPact Data elements include Medicaid Id, First and Last Name, SSN, DOB Y CHARM Data elements include a birth record number, child's last name, first name, date of birth, birth

city, gender, multiple-birth information, and birth weight. The data elements and rules that drive the matching are very flexible. More experimentation is currently in progress to that will lead to further refinement of the rules.

Y

50

Table 4.8 - Quality assurance procedures

Project Quality Assurance Procedures KIDSNET There were several ways a new child record can e added to KIDSNET. a) Manual Entry - In order to

manually enter a new child, the following information is required: Child first name, child last name, child DOB, child sex, parent first name, parent last name, parent DOB, relationship to child, street address, city, state, and ZIP code. b) Automatic Insert - In order to automatically insert a new child, the same information is required as for Manual Entry. c) Semi-automated Adds - Currently developing a semi-automated "add" process to be used chiefly to insert kids who were born out of state and therefore do not match a child from Rhode Island newborn information. The criteria have not yet been established.

FAMILYNET Required fields are first and last name and birth date. Fields listed as required, but that have default or 'don't know' response categories are: address, telephone number, race, ethnicity; language written, language spoken. Guardian's name is not required, but is usually present. The WIC module of the system allows linking of parent and child via an ID number. The module in development will allow linking of all family members and multiple family configurations. De-duplication may become easier (or harder) with this linking capacity.

ALERT There are checks for valid fields and immunization records that do not match previous imports. Partial dates of birth or immunization dates are not allowed. Methods include: range checking, frequency distribution for each essential data element, values within each data element to identify upper and lower values, identify statistical outliers, pattern checking of dates, check for inconsistency with Vital Records dates of birth, identify unknown, invalid, and unlikely combinations of values between variables, develop data quality audit summary report that includes: data item, number of invalid values, percent of records with invalid values, number of invalid values.

MCI We require that record contain first name, last name, a sequence number and a record type.

MOHSAIC First name, last name, and complete date of birth required. External data loads require a Department Client Number, which may be imputed from matching a client's name and date of birth to VS data.

Community Early Childhood Screening & Tracking

All the data is captured in the same system environment. Our problem is more related to persons using multiple names, the clerks not adequately researching the client register before creating a new client master, and confusion on Hispanic names.

ImmPact Requires patient's first and last name, date of birth, contact first name, last name, address (street, city, state, zip). Birth dates must be complete and in prescribed format and valid.

CHARM Currently, first name, last name and data of birth must be present. The date of birth must be complete. Generic names, like "baby" are rejected. Provider ID must match an existing provider. SSN's are edited. Gender is edited. Address is edited. Many demographic items are edited against valid codes.

51

Table 4.9 - Use of off-the-shelf software

Off-the-Shelf Software

Project Use? Product/Vendor KIDSNET No FAMILYNET Yes AutoMatch/Original Version sold by AutoMatch; no longer

available.

ALERT No MCI No MOHSAIC Yes dfPower Studio 5.0 + Customization Powerpack/ DataFlux

Community Early Childhood Screening & Tracking

ImmPact Yes Name Search/ Intelligent Search Technology Ltd.

CHARM No

52

4.1 Rhode Island Project Name: KIDSNET [1, 11] Responsible Organization: Rhode Island Department of Health Geographic Area: State of Rhode Island Annual Birth Cohort: ~12,500

Project Overview: KIDSNET is designed to integrate data from the following databases and/or programs: Universal Newborn Screening for Developmental Risk, Immunization, Lead Screening, WIC, Newborn Screening (Heel-stick), Newborn Hearing Screening, Early Intervention, Home Visiting and Risk Response, and Vital Records. Immunization, Home Visiting and Risk Response, and Universal Newborn Screening for developmental risk are currently integrated into a single database, KIDSNET. This acts as a data warehouse by storing limited information from the Lead, WIC, Newborn Hearing Screening, Early Intervention, and Vital Records databases. Data from the Newborn Screening (Heel-stick) program is not yet integrated but will be data warehoused.

Key Organizational and Staffing Issues:

• All participating programs are in the same Division of the Health Department except Vital records.

• No competing levels of government. • State DOH has authority and control. • Integration was an initial project goal even though funding was originally for just the

immunization system. • KIDSnet staff overall is 9.8 FTE with 3 FTEs data managers for error resolution and

data management and 2 FTE for data/clerical support. Community of Practice: Member of AKC 1, 2, and 3; Genetics Planning and Data Integration grant (HRSA) � AKC Project- Best Practices Source Book; HRSA/MCHB-DUE grant; CDC EHDI grant. De-duplication: KIDSnet supports both front-end and back-end de-duplication. On the front-end, data entry is done from paper forms with bar coded IDs. It is labor intensive but seems to produce fewer errors than the back-end de-duplication. Matching process is fully automated on the front end and manual on the back end. The merging process on the back-end is also manual. The user determines which data elements he/she wants to keep between the two records and then the system merges the data elements into a single child record. KIDSnet uses a pessimistic matching algorithm that is based on conservative matching criteria and yields less chance of duplicate errors but more manual matching. Records that do not match a child in KIDSNET go on hold.

53

The Rhode Island registry uses a custom developed program written in the PL/SQL programming language to generate a report of potential duplicates. Potential duplicates are identified using a Soundex comparison of last name, date of birth and gender. Once the report is generated, human review and assessment is required to determine which records need to be merged. Records are merged manually, one at a time.

KIDSNET is currently integrating and consolidating the initial data collection process for Vital Records, Newborn Hearing Screening and Universal Newborn Screening for developmental risk through a new Vital Records data system. This will allow the three programs to utilize a single identifier that will minimize data matching and redundant data entry.

Rhode Island is also in the process of developing and implementing a probabilistic matching algorithm that will be applied to the records on hold. The registry is also creating an automated import process to add records identified as �not already in the database�. Finally, an on-line automated merge tool is being developed to resolve and merge actual duplicate data into one record. Key Issues:

• Master record concept via Vital Records. • Records in hold file may be excessively delayed before being added to the file. • Merging requires a significant amount of human review. • The de-duplication processes are being re-engineered as part of a systems upgrade.

54

4.2 Oregon Project Name: FamilyNet Data System [11] Responsible Organization: Department of Human Services, Health Services, Office of

Family Health (OFH) Geographic Area: State of Oregon Annual Birth Cohort: ~45,200 Project Overview: FamilyNet has been in development since the mid 1990�s. In 2000, the Oregon Department of Human Services (DHS) began developing the public-sector module of FamilyNet, as a public-sector health information system for local agency use to integrate and coordinate health assessment and service information about children and families. FamilyNet will help public and private providers coordinate services to children and families and monitor risks, conditions, services, and outcomes over time. It will support coordination of services and evaluation of the service delivery system while assuring individual and family confidentiality and data security. The Oregon�s Children�s Plan is a 2001 legislative mandate which expands the data system beyond the FamilyNet health services model. The rationale behind FamilyNet is to create a single, cumulative record for each client by tying together module level records. FamilyNet goals include: avoiding redundant data entry by collecting data shared among programs only once; providing timely access to data for both state and local health departments; increasing accountability for state and federal program conditions including program and fiscal assurances; and reducing fragmentation of data and health care services available to the public by providing a method to coordinate services among health and social service programs. The hub is a Client Master that contains demographics and contact information (addresses, family links, telephone numbers, guardian�s name). Key Organizational and Staffing Issues:

• All participating programs are in the same Division of the Health Department. • Broad legislative mandate for program integration, but with some limitations on

sharing information outside of public health departments. • Extensive strategic planning, requirements definition and risk analysis • Development and implementation of a communications plan

Community of Practice: Genetics Planning and Data Integration grant (HRSA) � AKC Project - Best Practices Source Book; Turning Point. De-duplication: �The purpose of this document is to define the requirements for electronic matching and merging of records imported from external systems and the identification of duplicate Participants and related data. These requirements were determined based on JAD sessions with participants including DHS program and technical staff� [6].

55

The diverse sources of data received by OFH have created a significant challenge to minimize the number of duplicate participants in their databases. FamilyNet with its integrated Client Master database is intended to help eliminate the number of duplicates. OFH has put forth the following business goals with respect to FamilyNet data.

1. Maximize Data Quality a. No more than 5 � 7% of duplicate participants in Client Master. b. Maintain copies of useful old demographic data (e.g., address, phone number) c. Additional human resources may be required even after automated best

practices are implemented.

2. Define clear criteria for identifying duplicate data at the Client Master and Module Level.

a. Client Master is first priority b. Module data is second priority.

3. Create an import utility to be used for all Client Master Data Imports. Each

FamilyNet module will be responsible for the process to import, de-duplicate and merge module data.

4. Create a batch process to review Client Master data, identify duplicates and merge duplicate records based on specific match/merge rules.

5. Create an efficient on-line process for reviewing and resolving instances of duplicate (or suspected duplicate) Client Master data.

6. Wherever possible create reusable modules to support goals 3, 4 and 5.

Key Issues:

• Clear specifications • Master record concept • Decentralization of de-duplication processes to programmatic units.

56

4.3 Oregon Immunization ALERT Project Name: ALERT [2] Responsible Organization: Department of Human Services, Health Services, Office of

Family Health (OFH) Geographic Area: State of Oregon Annual Birth Cohort: ~45,200 Project Overview: ALERT is part of a long-term strategy to improve immunization coverage for Oregon�s children. Authorized users include health care providers (both private and public), health plans, schools, hospitals, and parents. It began in the early 90s and was created by public-private partnership -Oregon Health Systems in Collaboration (OHSIC) which funded ALERT in 1996 as their first collaborative project. The approach was used because most immunizations are given in the private sector in Oregon. The primary focus is on immunization records of pre-school children, although ALERT has a growing volume of records for school age children [2]. Key Organizational and Staffing Issues:

• Public �private partnership model- leverages non-governmental resources and provides more flexible administration.

• ALERT exchanges data with FamilyNet but full participation is limited by the scope of the enabling legislation that would have to be changed to allow physicians participating in ALERT to view other Family Net data.

Community of Practice: AKC 1

De-duplication: The Oregon registry uses customized software called Resolve on a daily basis to identify demographic and immunization records with matching names, dates of birth and other key identifiers [2]. This process will be further automated using the "Auto Resolve" feature of this product. The next phase to enhance and improve the de-duplication process will be to re-direct staff efforts to resolving the more difficult, less obvious duplicate records. To accomplish this, pattern recognition software will be used to employ new criteria for matching and to increase the percentage of accurate, standardized and probabilistic matches of registry demographic data. The Oregon registry currently has over 700,000 demographic records and over 5 million immunization records. Key Issues: • The ALERT immunization registry participates in FamilyNet by receiving immunization

data from the public sector Immunization module. Other records are coming mainly from sources external to public health- physicians and health plans. Family Net and Alert are not yet linked.

• Use of a customized product solution. • Moving to pattern recognition and probabilistic matches. • Data quality issues are more high profile. • Focus on data use promotes quality.

57

4.4 New York City Project Name: CIR-Lead Quest Integration Project [11] Responsible Organization: New York City Department of Health and Mental Hygiene

(DHMH) Geographic Area: New York City Annual birth Cohort: ~125,000 Project Overview: A directive to integrate the Citywide Immunization Registry (CIR) and Lead Poisoning Prevention Program (LeadQuest) databases was issued by the NYC Commissioner of Health, who recognized the potential of the project to leverage resources of the participating programs by creating a single system for provider outreach, case management and data analysis. CIR and LQ target the same population, primarily children 0-7 years of age. An integrated database would relate data for the same children across systems and help identify children at high-risk of under-immunization and lead poisoning. The integrated system will provide immunization and lead status on-line, improve data quality across both systems by consolidating records and create a centralized de-duplication service to be used by different units within DOHMH. The linkage involves the creation of a Master Child Index (MCI) and a Data Warehouse to identify and track the health of NYC children. Birth data from vital statistics data maintained by the NYCDOH will be incorporated to populate the system. The CIR-LeadQuest (LQ) integration is the first part of a larger, phased-approach initiative to create a comprehensive citywide child health registry in the DOH.

The two databases will not be combined, but will be integrated through a Master Child Index (MCI). All children found in the two systems (and subsequent systems to be integrated) will be "registered" in the MCI to facilitate matching children across these systems. The MCI will use sophisticated business rules to match new information to children in the MCI, de-duplicate children with multiple records or duplicate information within a record, and merge children across the databases. The MCI will contain identifying and demographic information for every child contained in at least one of the participating systems. Each participating system will interact with the MCI to: (1) add data to an existing MCI record or register a new child in the MCI and load its data; (2) use MCI�s services (a record de-duplication service); (3) access MCI demographic information; (4) identify whether there is information in another participating system that is available for display to a user via the MCI connection; and (5) transport the requested information from one system to the other in real time and present it through existing applications (or make it available to existing batch processes). These capabilities will be provided through a set of standard "services" available on the network to authenticated, eligible systems.

Key Organizational and Staffing Issues:

• No previous program integration between Lead and Immunizations-merger of two cultures.

58

• Reorganization at start of project moved the data integration activities into a new office of health surveillance.

• Enforced multi-vendor contract responsibility reinforced differences in program and technology cultures and complicated development and deployment.

• Source programs are now in different organizational divisions of NYC Department of Health and Mental Hygiene following change of mayoral administration and reorganization in 2002.

• Changes in project leadership staff. • NYC is responsible for its own Vital Records, separate from the state.

Community of Practice: CIR a participant in AKC 1, 2, and 3. CIR Project Director is President of AIRA.

Deduplication: The NYC uses a commercial product, now called Choicemaker, for its de-duplication. The NY Citywide Immunization Registry (CIR) has about a 30% de-duplication rate.. Given the size of the CIR ( 2.3 million records, and over 14 million immunizations) and the large volume of monthly submissions, an automated solution to this problem is vital. CIR has adopted CMT (now called Choicemaker), which uses a new technique from statistical Artificial Intelligence. Choicemaker assigns a probability score to candidate record pairs based on a number of features which 'fire' depending on whether they are the same or different in each record pair. The weight of each feature is acquired during a 'learning' process by which Choicemaker is trained on a set of record pairs tagged by Registry staff. Each feature's weight depends on how well it persuades the human scorer that two records do or do not belong to the same child. For example zip-code-same feature received a lower weight than the telephone-same feature. The overall probability for each record pair is based on the number and weight of the features arguing for or against a merge. Record pairs with a high probability are automatically merged. Currently Choicemaker�s features include exact and Soundex matches on first and last names, date of birth, gender, street number, Medicaid and medical record numbers, zip code, mother's maiden name and mother's date of birth. Choicemaker succeeds in removing 96% of record pairs from human review with more than 99% accuracy. Choicemaker has already successfully de-duplicated the 1997 and 1998 birth cohorts. The success of Choicemaker rests heavily on its being trained by knowledgeable users.

CIR and LQ data quality is expected to improve with MCI. 1. Fewer fragmented / duplicate records in CIR and LQ due to front-end Choicemaker software 2. Values in all demographic fields will be stacked

59

• No time will be spent deciding which value is �right� � stacked values will be used to review incoming values

• Easy and successful identification of children o by DOHMH ( always use DOHMH) staff o by providers and parents o Providers will be more motivated to report accurately

3. Better able to identify MCI/CIR/LQ true population - denominator: Improved denominator data for LPPP because of inclusion of Vital records Improved denominator data for LPPP and CIR because of inclusion of children not born in NYC who have immunizations or lead tests but not both. 4. Enhancement of LPPP data because of access to race and Medicaid status (particularly for non-cases). 5. Verification of LPPP data via access to Vital Record Data. Key Issues:

• NYC has the most productized deduplication software. • Research orientation of the initial software developer and NYC staff contribute to its

sophistication and functionality. • Expense, level of effort and resources consistent with size of NYC database, (2.3

million children, 14 million immunizations) but may not be sustainable or replicable to projects of lesser scale and staff.

• Product vendor has codified much of the learning developed by NYC into a feature called Clue Maker which it intends to make available for future sales. The efficacy of reuse of the learning in a new live production environment has not been tested.

• Delays in scheduled deployment exacerbated by multi-vendor environment. • Strategy of moving deduplication engine into MCI not yet proven in operation as

MCI is still not running in production.

60

4.5 Missouri Project Name: Missouri Health Strategic Architectures and Information

Cooperative (MOHSAIC) Responsible Organization: Center for Health Information Management and Evaluation

(CHIME) Missouri Department of Health Geographic Area: State of Missouri Annual Birth Cohort: ~75,000 Project Overview: The Missouri Department of Health and Senior Services (MODHSS) is developing an integrated public health information system to support all programs and systems that relate to surveillance, and/or client services (both health care and regulated clients). Common functionality has been identified and grouped together. The application has been developed to support common functions: registration, scheduling, inventory, disease reporting, etc. All data are being integrated in an Oracle Database with each user having the ability to view data based on his/her function and security level. The data are organized around a specific client and his/her relationship to other providers and services. To date, the following components have been integrated:

Surveillance Area Communicable and Vaccine Preventable Disease and other reportable conditions

Client Health Management Area Client Registration, Scheduling and Household Management; Inventory Management; Immunizations; TB Skin Testing; Family Planning; Family Care Safety Registry)

Regulated Client Area Regulated functions for Bureau of Narcotics and Dangerous Drugs; Lead Abatement Inspector Registration

Components currently in a phase of analysis, design or development include: Surveillance Area (Reporting of STD/HIV cases; Elevated Blood Lead Levels; Electronic reporting of laboratory results); Client Health Management Area (Service Coordination for Special Health Care Needs and other children; Inquiries and Complaint Tracking; Resource and Referral Services; Blood Lead Level Screenings; Newborn Metabolic and Hearing Screenings and Case Management; Newborn Home Visitation; WIC Registration); and Regulated Client Area (Child Care Licensing). In addition, MOHSAIC staff is completing the necessary infrastructure applications for quality assurance and security activities. This approach resulted from a comprehensive assessment of MODHSS�s organizational strengths and weaknesses that revealed weaknesses in overall strategic use of communications technology. It became clear to the department director that an integrated system was needed to reach Year 2000 goals. Other key factors were the cost and difficulty of maintaining over 60 program-specific computer systems serving individual health

61

programs. The systems ran on a variety of platforms since there were no hardware or software standards. In the mid 1990�s, the National Immunization Survey ranked Missouri 49

th in the nation for

two-year-olds who were adequately immunized. Governor Mel Carnahan and legislators agreed to address this issue with a statewide immunization registry. General Revenue funds were appropriated to create the registry and provide access to all local public health agencies. The resulting infrastructure and Immunizations and TB skin testing formed the first components of the MOHSAIC integrated system. Subsequent programs have built on this initial system. Key Organizational Issues:

• Gubernatorial and legislative action to develop immunization registry. • Strategic planning and resultant architectural design allow MO to use diverse funding

streams to develop pieces of the system even if not in the most desirable order. • Gubernatorial and legislative support for improving immunization coverage. • Immunization registry was the driver and initial building block of the system. • Strong information technology support • Strong and influential leaders in national IT initiatives.

Community of Practice: MO an early INPHO state, INPHO 2 and 3; Genetics Planning and Data Integration grant (HRSA) �AKC Project- Best Practices Source Book; Turning Point (Local Public Health) De-duplication: Integrated database: Health Management is client-centered, Surveillance is case-centered. Surveillance system can look into health management system if necessary. Created a registration/demographics/scheduling/vaccine inventory core first, which includes immunization for all ages and TB for skin tests. DCN (Client ID) being used by Social Services, WIC, Medicaid. Added it to Birth Certificate and other systems retrospectively. Integrated plan enrollment information into the central database. Integration with laboratory is not quite there yet. Other pieces are in different states of development or deployment. Merged data is transformed and brought into MOHSAIC. Linked data is only viewed through MOHSAIC. Even with this tight integration, some systems have to remain separate either because they are purchased or because it makes sense for them to be separate. Took time to understand that some systems need to be linked or merged instead of integrated. Only local health departments in two sites in MO are not using central Immunization Registry. They have access to see immunizations and MO is working to absorb their records from other systems. Private Provider systems have had data absorbed. (Medicaid, large health systems have had data abstracted and absorbed). Basic De-duplication Process Steps Overview:

• Determine candidate duplicates • Verify duplication • Determine what data is to be deleted and what is to be consolidated • Reconcile the duplicate information

62

• De-duplication and Data Quality Issues • How to correct match errors • How to undo the reconciliation if an error is made. (Twins come to mind as a

frequent culprit of this situation.) • Model and table structures to support the De-duplication process • Which data is the most current

Master Index Concept: Incoming data is first compared to VS statistics data. So, VS is acting as an authoritative source for matching. Some secondary matching is done using the Department of Social Services ID. When children are added to MOHSAIC, they are assigned a unique ID, called a Party ID. The Party ID is used throughout the system as a foreign key. When a match is determined and records need to be merged, the merge process involves

• merging physical records (and obsolescing any old records). • changing all references to obsolete records to the new merged record. Because there

are many tables where the ID is used as a foreign key, this is a cumbersome task. (See MOSHAIC de-duplication process diagram in Appendix A.)

The Department�s OIS staff is responsible for the de-duplication process. Key Issues:

• Current de-duplication tool allows them to merge records, but not to locate and find them.

• Several different tools are used in different stages of processing. • Issues with records in hold file- quantity not yet known. • De-duplication responsibilities centralized in OIS.

63

4.6 Kansas Project Name: Community Early Childhood Screening & Tracking, a integral

part of the Kansas Integrated Public Health System (KIPHS) [11]

Responsible Organization: Community Early Childhood Screening and Tracking integration project is a project of the Wichita-Sedgwick County Department of Community Health. KIPHS is a managed by a partnership between Kansas Health Foundation (KHF), Kansas Health Institute (KHI), and the Kansas Association of Local Health Departments (KALHD)

Geographic Area: Wichita-Sedgwick County, Kansas Annual Birth Cohort: ~6,500 Project Overview: The Community Early Childhood Screening and Tracking project, now in the requirements determination stage, is a software application designed to ensure children receive necessary testing and follow-up. Specifically, it will link data from the immunization registry, metabolic and hearing screening programs. Public health providers and private providers will have access to the data. The software application will run on a community information system infrastructure. In addition, certain components of the integration project, particularly the immunization tracking portion, will be integrated with the broader Kansas Integrated Public Health System project. Components will be integrated by a variety of integration strategies, including common relational database architecture, mirroring strategies (primarily state and local immunization registries), and common access routines. The basic design of KIPHS is one of an integrated client encounter system. It is client centered, utilizing a centralized client registry so that all health programs and services are linked to a common client record. With the exception of WIC, KIPHS has integrated all public health client service provision activities at the local health department level. WIC will be fully integrated at the local level under contracts currently being negotiated. KIPHS is constructing the first local health department- to- state integration module that will cover all MCH program reporting data from all health departments, once the KIPHS implementation is complete in 2002. The specific integration project is in the initial requirements determination stage. The decision to integrate was part of the overall plan for state IS support developed in the mid-1990�s. The initial planning efforts for KIPHS began in the fall of 1991 with the development of a strategic plan for the Wichita-Sedgwick County Department of Community Health public health information needs. The plan identified the need for integration of services at the local level while simultaneously fulfilling the reporting and data needs to the state health department. Interaction with KDHE personnel led to the decision to develop a similar plan at the state level prior to developing a new and comprehensive information system in Wichita. KIPHS has resulted in a much greater level of interaction among different programs in the state and county health departments, since the same software supports all programs.

64

Key Organizational and Staffing Issues:

• Community-based approach with county level planning for data integration to report to the state as driver for state-level planning

• Vendor leadership role • Extensive strategic planning and rigorous requirements setting. • Common technical architecture among programs • Low sustainability due to new public health and funding priorities

Community of Practice: First INPHO fellow assigned to KIPHS project; Turning Point De-duplication: KIPHS is used by the health departments as an on-line encounter system. There is a routine in KIPHS in place to enable a user to look for a duplicate client. Duplicates are created because the intake clerk does not properly search for an existing record, and relies on the clients when they say they never have been to the health department before. Often clients don�t want the clerk to pull up their records with a balance due in most cases. Some times the clerk doesn't correctly verify the client's name, etc). The routine allows the user to combine service records, but does not allow the user to integrate the financial history. Key Issues:

• KIPHS de-duplication work is not tied to the integration project but would use the de-duplication routine already in the base application.

• Centralized client index relies on input person doing a thorough search to avoid adding a new record; but may be compromised by client presenting misinformation and/or lax user habits or press of workflow.

• Not really tested in the integrated environment due to insufficient deployment.

65

4.7 Maine Project Name: Maine Public Health Information System-MPHIS [11] Responsible Organization: Maine Department of Human Services, Bureau of Health Geographic Area: State of Maine Annual birth cohort: 13,720 (1998 preliminary) Project Overview: The Maine Public Heath Information System (MPHIS) will support real-time, web-based public health communication and data transactions with primary care, laboratory, hospital emergency departments, health engineering, and community based provider agencies (e.g., immunization administration and vaccine ordering, selected health screening and lab test results, infectious disease reporting, restaurant and other facility inspections, and seamless interface to secure two-way health alert communications). It also will provide current data for the Bureau and its programs for planning, operations and evaluation. MPHIS will feed a public health data warehouse for release to community partners, other state agencies, and the public through a web site and other means. MPHIS will incorporate the National Electronic Disease Surveillance System (NEDSS) base system, with the functionality of Maine�s ImmPact Immunization Registry. ImmPact, the Maine and New Hampshire Immunization Registry, began implementation in 1998. It is a state-wide, web-based system that calculates vaccine and preventive health care visit requirements from birth through death, provides reminder recall services for Immunization, preventive health care visits, client notification of EPSDT eligibility, and end of eligibility notices for Medicaid clients approaching 21 years old. MPHIS will be fully accessible to and used by all medical providers, health care facilities, and community health agencies (as appropriate to respective functions). NEDSS data will be stored in a data repository or warehouse that will also receive data from other public health databases (such as vital records, the Maine Cancer Registry, Maine Behavioral Risk Factor Surveillance system.) The data repository will be accessible to the Bureau of Health (and other state agencies per data sharing agreements) for public health assessment, program planning and evaluation. The data repository will also feed data to a public web-based community health information system that will be an independent, stand-alone system that provides up-to-date comprehensive information on health status, quality of care and population-based health outcomes. The MPHIS is expected to be fully designed and in pilot by the end of one year and fully operational within three years. Program development and operations have been categorical, driven by funding, yet focus populations and key internal and external partners overlap. Due to a lack of historical information system collaborative planning processes, information systems within the Department of Human Services and the Bureau of Health, and even within individual programs of the Bureau, were being developed without a larger comprehensive direction. Recent efforts to assess the Bureau�s information system needs and capacity, and the rapidly evolving information system technology have encouraged a vision of integrated public health

66

information that serves public health officials, medical practitioners, community health agencies, and the public at large. The existing ImmPact Immunization Registry serves as a successful example of a web based information system developed by a collaborative effort between the Bureau of Health and Maine Medicaid, which was easily adopted by local medical practitioners. Other federally driven efforts such as NEDSS and the Health Alert Network (HAN) identify additional goals and objectives of public health information systems, and offer a feasible base system. Key Organizational Issues:

• Integrated public health information systems concept relatively new. • Funding sources have driven development without an overall architecture • Collaborating with Medicaid improved provider participation in immunization

system. Community of Practice: AKC 3; Turning Point (Medicaid case development). De-duplication: Maine has a semi-automated de-duplication process, using a probabilistic approach to its front end and back end processing. OTS products, Name Search and Intelligent Search Technologies, are used to identify the probable matches. This process creates a screen to display pairs of records that meet a score criteria indicating a potential. The user decides to merge or disassociate each pair of records. Medicaid patient records are automatically merged with existing patients where there is an exact match on first name, last name, middle initial, date of birth and SSN. A detailed discussion of the manual deduplication is referenced in Appendix A. Incoming records are tagged with an ownership code based on business rules. Some important features of the process include the reliance on Medicaid as the authoritative source of information over a provider record for the same patient. Record date is a key field, and the newer record is presumed to be the better one. Date of birth (DOB) is a key field that is required to be complete and correctly formatted. If duplicate patient records are identified by the results of a patient search, these records can be merged using a series of steps against the back-end. Only the ImmPact System administrator or technical staff designated by the system administrator should attempt this process. The information that exists in the duplicate patient record but not in the original patient record will be added to the original patient record. Key Issues:

• Data cleaning and formatting are done prior to the de-duplication process.

67

• Because of the Medicaid partnership, Medicaid rules for changing data on addresses pre-empt other possible better sources of authoritative data.

• Immunization Registry and its associated data feeds are subject to this process. • Other business rules and processes will likely be required for full integration project.

68

4.8 Utah Project Name: Child Health Advanced Records Management (CHARM) Responsible Organization: Office of the Chief Information Systems Officer, Utah

Department of Health Geographic Area: State of Utah Annual Birth Cohort: 47,000 Project Overview: CHARM is integrating the state�s immunization registry (USIIS), newborn hearing screening (HiTrack), Vital Statistics, Newborn Screening, Baby Watch and Early Intervention, Birth Defects Network, Children with Special Health Care Needs, Women Infants & Children (WIC), Neonatal Follow-up Program, Medicaid, Child Health Evaluation and Care (Utah�s version of Early Periodic Screening and Diagnosis and Treatment, or EPSDT), Child Health Insurance Program, Lead Screening and other child information systems. However, the current version of CHARM only integrates the first three. CHARM uses a middleware solution to link the operational systems within the participating programs and thus provides services to a virtual �Child Health Profile� database of shared data elements.

In addition to CHARM, some of the participating programs currently share data directly. For example, the Medicaid and child welfare programs share a common intake process that results in a common identifier. Also, the immunization registry and WIC systems import data directly from the Vital Statistics as a means of populating their databases.

In 1997, the Utah Department of Health adopted an Information Systems Vision. It called for data to be entered only once, to be complete, uniform and accurate, to be readily available to authorized users, and to meet the users� needs of availability and usefulness. In early 1999, the department�s executive leadership made an investment in, and a long-term commitment to, systems integration by hiring a CIO with a clear department-wide integrative mission. That summer, the UDOH formulated and adopted its first department-wide business principle calling for a client-centric way of doing business. In the fall of 1999, a new integrative strategy was formulated during two joint program-IT retreats. This strategy is currently being pursued, and CHARM is one of the five strategic initiatives adopted at that time. Perceived benefits of having different programs working together include enhanced client satisfaction, improved client services, improved multi-problem response, reduced cost, improved assessment, outcome measurement, information for private providers, and improved monitoring of program coverage. Key Organizational and Staffing Issues:

• Continuous proactive leadership - 10 year tenure of Health Officer • Adoption of Information Systems Vision and hiring of CIO • Collaboration with Utah State University�s Department of Computer Science • Continuous project updating and quality improvement

69

Community of Practice: INPHO; Connections; Genetics Planning and Data Integration grant (HRSA) �AKC Project- Best Practices Source Book; Leadership in NAPHSIS De-duplication: CHARM doesn�t have a central repository of all child information. Instead, it creates the illusion of a shared repository, but actually retrieves data from individual participating programs on demand. The complete set of shared data for a child forms a virtual record that is not stored in any single place. CHARM calls these virtual records Child-Health Profiles (CHPs). For CHARM the de-duplication problem involves correctly matching child records among the participating programs and removing duplicates by either merging records or linking related records. It addresses the de-duplication problem by providing a suite of front-end and back-end features and by taking advantage of as much available information about a child as possible. One front-end feature is the ability for a user in any participating program to search the set of known CHPs for matches to a child that is being added to the participating program�s system. If matches are found (meaning the child was already known to one or more participating programs), the user can choose to have that information populate the local record and to have this resulting record logically linked to that CHP in the integrated system. The users can also choose to create new CHPs for new records in their systems or merge records as needed. Users can also interactively try to match existing records in a local system with existing CHPs. If matches are found, the user can choose to immediately merge matching records so all the information for a child is logically linked together or defer that activity to another time or for another person to handle. CHARM�s back-end features include processes for periodically scanning the CHPs for matches. If a cluster of potential matches is found, the system will either process the merge automatically (if the match is certain enough) or it will record that information for manual resolution. CHARM�s matching algorithm is a rule-based system driven by an easily configurable set of weighted rules. Each rule is made up of some number of weighted comparisons. Each comparison can reference one or more pieces of data (e.g. first name, last name, birth date, etc.) and compare them using a specific function. CHARM supports a variety of equality and fuzzy comparisons, such as edit distance. If the sum of all the weighted comparisons for a rule is greater than a specific threshold, then the action of the rule is performed (e.g. the records are determined to be absolute matches, possible matches, or definite non-matches. The comparisons can reference a wide range of fields, including the names for the child, mother and father; contact information of all kinds such as address and phone number; birth place, time, weight, multiplicity, and order; medical identifiers, and dates of health care events. The comparisons can also take advantage of a new Birth Record Number system that has recently been implemented in the Vital Statistics, Newborn Screening, and Newborn

70

Hearing Systems. This identifier comes from the newborn kits distributed to hospitals and helps to uniquely identity newborns across the state. Key Issues:

• CHARM provide features for both front-end and back-end de-duplication. • The user�s view of the front-end features depends on how a participating program is

integrated with CHARM. • The CHARM matching algorithm is a rule-based system, but is highly configurable

and can take advantage of a large variety of data fields and comparison functions. • The CHP includes good identifiers and discriminators, such as the Birth Record

Number. • Matches can be merged immediately or deferred • Matches can be merged automatically when they are absolute matches and the

merging is straightforward. • The resolution of questionable matches can be deferred and handled manually. • CHARM�s approach to de-duplication has not yet been proven in operation, it is still

waiting for final deployment.

71

5. Observations from Study There is no question that poor data quality can degrade the value of an information system and even render it useless. A critical part of obtaining and maintaining high-quality data is ensuring that a system contains as little redundant information as possible. This can be particularly challenging for integrated child-health information systems since the data comes from a variety of sources, each with potential quality problems of its own and slight variations in the semantics of common data fields. This study has looked at the de-duplication problem from both a technological and case study perspective. Section 5.1 summarizes technical observations from this work that might help current and future integration projects improve their approaches to de-duplication. Section 5.2 lists a number of important issues that go beyond the technology, but are critical to the overall success of de-duplication in any integrated child-health information system. Finally, the study uncovered a number of issues that were beyond of the scope of this project, but would benefit the public health community if they were researched further. These issues are listed in Section 5.3.

5.1 Technical Observations

5.1.1 Overall de-duplication processes and algorithms No single solution would work for all integrated information systems. There are too many variations in how the systems receive data from participating child-health programs, the structure of that data, the quality of the data from the individual sources, the timing of when the data becomes available, and even the intended uses of the integrated data. So, instead of looking for a canned solution, integration projects should consider the following technical issues and formulate an overall solution that is customized or adapted to their own situations.

1. When will matching occur? 2. What piece of information among the shared data can best be used to identify

potential matches? 3. How will the data be standardized so searching and comparing operations are more

effective? Can off-the-shelf software help with this? If so, how? 4. What kind of matching algorithm (multi-field, rule-based, machine-learning, etc.)

would be most effective given the type and quality of the available data? 5. How will be potential matches be verified (automatically or manually)? 6. How will actual matches be merged or linked? 7. Will the results of the matching and merging be propagated back to the original

sources? 8. How will mistakes in matching records be identified and undone?

5.1.2 Level of automation It is not clear whether front-end or back-end systems are more automated. Both involve decision points that require human interactions. By separating the de-duplication problem

72

into data-item cleaning, matching, and merging processes, it is possible to conclude that the processes for standardizing data item values and for identifying potential duplicates are often more automated than processes for determining actual duplicates and merging data. However, beyond this basic observation, the project team noted considerable variance in the degree of automation among the available software projects reviewed in Section 3 and among the subject systems described in Section 4.

5.1.3 Record Matching The range of products that deal with record matching was staggering. They differed in how they connect (or don�t) to information systems, how they search for potential matches, the number and type field comparisons they support, the level and type of user interaction, and how they can be customized. Determining which product is best suited for a particular system depends of the specific requirements of that system. Because of this and the huge variance in products, it was impractical to evaluate off-the-shelf products in a general way. Instead, this research provided a framework for conducting such evaluations and a sample evaluation of one product. See Section 3. The integrated child-health systems reviewed in Section 4 also differ greatly with respect record matching. For example, New York City�s system uses a machine-learning approach, whereas Utah�s approach uses a rule-based approach that supports weighted, fuzzy comparisons. Missouri�s is also rule-based, but the comparisons are more straightforward. The Oregon and Rhode Island systems take advance of Soundex technology for some of their field comparisons. Although most of projects use some kind of scoring or weighting scheme, none of them appear to be using true probabilistic field comparisons, which take into account the frequency of the possible data values in determining the strength of a match. This may be an interesting area for future research. See Section 5.3. In general, there is insignificant data to conclude whether one matching approach is better than another. In fact, it is not reasonable to do such a comparison because the differences in the overall approaches and situations make it difficult to come up with a common basis. Instead, to determine the effectiveness of record matching, an integrated system must be prepared to evaluate itself, independent of others, using test data that is representative of conditions found in its real data. See Section 5.3 for ideas on future research with respect to test data.

5.1.4 Source of information and effective data element for matching Below is a list of observations regarding the sources of information and the effectiveness of various data elements in identifying potential matches:

• Most systems consider vital statistics the authoritative source for birth date data, but not for addresses.

73

• In Maine, where Medicaid owns the authoritative demographic record, there is a problem when other non-Medicaid information needs to be merged with the Medicaid information. Medicaid does not allow other providers to update active Medicaid providers.

• Medicaid and WIC have similar rules that restrict address changes to their respective programs. Addresses cannot be changed or superimposed by another program. This restriction is typically stated in the Memorandum of Understanding (MOU), which allows certain of their data to be shared by other programs.

• There is a presumption by some projects that the data with the most current date is the most authoritative. Business rules to establish which date is used, for example, date of last contact or date into the system, date of last transaction or some other measure would contribute to more efficacy of this approach.

• To offset the problem of determining valid addresses, some projects �stack� the addresses so that more than one can be used in matching, and also to provide additional addresses for outreach as many of the target clients are transient. NYC had adopted this approach.

• Other than in Maine, no single program emerged as an authoritative source of demographic information, although Rhode Island is looking at making newborn screening an authoritative source. This is an important result because it indicates that the health systems are very different and that no single approach to de-duplication would work.

• It is also interesting that the program which sponsors the integration is usually not the Vital Statistics, which is the authoritative source for certain key data elements (although there seems to be a tighter coupling of the VS program in UT.)

• Use of VR demographic information in records that are accessible by people outside of Public Health Departments is generally restricted by state privacy laws, but it is often permitted to be used in matching programs as corroboration of information that comes in via another record source- e.g. a provider records. Projects that use this practice may �mark� the VR demographic information to make sure that it is used only for matching and not displayed.

• Many projects use the mother�s birth name (maiden name) as a key field, as this is one of the National Vaccine Advisory Board (NVAC) core data elements required for Immunization Registries. However, while this can always be obtained from the birth record, it is not consistently used in other records with Mother�s Name, particularly in medical or claims systems (unless the mother is still known by her birth name.) Some projects find the mother�s first name is the one least likely to change and is more reliable

5.1.5 Record Merging All of the projects indicate they have both a front-end and a back end process. In most cases, the front-end process is to match records entering the system while the back end processing is mainly to remove duplicates. Some systems have special screens and design tools to facilitate the record merge process.

74

5.1.6 Deployment Timetables A common theme seen in all of the integration projects is a great underestimation of the time and effort to plan and execute de-duplication processes. Almost all of the projects had exceeded their target deployment dates. Where integrated projects work only with internal health-department systems, it is more possible to control implementation timing. Where external stakeholders are involved, decisions made within their own organizations can affect the timing and also specification of the de-duplication effort. The deployment of a master index approach to de-duplication is more heavily impacted by decisions made by individual programs or stakeholders than a more incremental approach that applies de-duplication to specific files and applications. Therefore, none of the projects using master index de-duplication engines were or had been in production at the time of this study, so the efficacy of this approach is not yet known.

5.2 Non-technical Issues

5.2.1 Scope and Organization of the Integration Effort The integrated projects varied considerably in scope. At the time of this study, none of them integrated exactly the same set of programs. Most of them plan on adding additional programs in the near or long-term future. Currently, Rhode Island and Oregon�s FamilyNet have the highest programmatic involvement, 9 and 7 respectively; Missouri and Maine have 4; New York City and Utah have 3; and Kansas has 2. There is a difference in the level of integration that has been implemented as compared to that which is conceptual or planned. Missouri and Rhode Island are among the most mature integration projects, but these are vastly different state environments. Also, there seems to be more programmatic than technical control in Rhode Island, with the opposite in Missouri. Another way integration projects differ from an organizational perspective is whether the de-duplication activities are centralized or decentralized. New York City, Oregon, and Rhode Island (all original AKC sites) have adopted a Master Client Index approach. Utah also uses a Master Client Index of sorts, the client data is represented with virtual records and not is a single repository. With a centralized approach, there is a potential for the following pitfalls:

• Operations become an �orphan� from a funding or administrative perspective • Programs may feel that they are losing control over their data

5.2.2 Intended Use of the Integrated Data Establishing the intended use of the integrated data is important to how de-duplication is approached. Four broad uses are clinical support, case management, program operations, and long-term analysis. For clinical support, the integrated data must be extremely complete and accurate because health-care professionals may use it as a basis for clinical decisions. This intended use may require a level of quality that is beyond what is practical for many systems. Case management and program operations require data that are mostly complete and relatively accurate. Some errors may exist, without life-or-death consequences. A quality level sufficient for these uses should be obtainable for most integrated systems. Data

75

analysis often requires only aggregate or statistical data. A certain degree of error (e.g. duplication) can exist without dramatically affecting the results.

5.2.3 Role of the Immunization Registry Beginnings Work on immunization registries have had a significant impact on the integration systems and de-duplication within those systems. Here are some observations, in no particular order.

• All systems include immunization registries driven by AKC and CDC funding. • Only Rhode Island and Missouri initially developed the immunization registry within

an integrated system concept. . • For three of the projects, (Rhode Island, New York City and Oregon�s ALERT),

immunization records are received almost entirely from provider practices. In other areas, there is a large public health component in which immunizations given in the public sector are primary in the registry.

• Immunization registries were among the first that required public health to establish a regular data exchange with private practices and health plans and to address business rules and quality control policies with external parties.

• The term de-duplication was coined based on the issues confronting immunization projects.

• The All Kids Count conferences on Immunization Registries, initially for their grantees and later expanded to all registry developers, provided a forum for best practices and enlarged the body of knowledge about de-duplication.

• The CDC National Immunization Program Immunization Registry Support team continued to focus on de-duplication by adding it to the Registry Functional Requirements and Core Data Elements (approved by NVAC and required for registry certification).

The CDC has developed a de-duplication testing toolkit with 500 test cases for the testing of de-duplication algorithms in immunization registries.

5.2.4 Role of Vital Records Below are some observations about the role of Vital Records:

• All of the projects except KS (which is a community-based initiative) include Vital Records

• Vital Records is the authoritative file for date of birth, but not addresses, as well as providing the population base (denominator) for the project.

• All states have implemented electronic vital records and many are upgrading and re-engineering them.

• It is also interesting that the program which sponsors the integration is usually not Vital Records, which is the authoritative source for certain key data elements (although there seems to be a tighter coupling of the VR program in Utah.)

• State law controls who has access to VR data and which data elements may be used or shared and under what conditions. This is less problematic within the public health component of health departments but may be an issue for records the provider may be

76

able to see. VR as a matching file often has to be a background verifying-activity not a data source itself.

• Experience with immunization registries highlighted the problem of matching incoming records from provider offices with VR name and address because of variations of birth name and name actually used, and with address changes.

• Newborn screening/VR integrations highlight these discrepancies earlier and may lead to the development of better information. (Rhode Island and Utah are looking at this.)

• The process of birth/infant death matching which states perform often provide the gold standard for record matching, but a study performed by the Arkansas DOH indicates that even small changes in the algorithm can affect the accuracy. The AR study indicated some differences in state definitions and practices that impact comparability.

• The NAPHSIS project to re-engineer the VR process for all states may contribute to the greater usability of these files as reference data for de-duplication.

5.2.5. Role of Communities of Practice Participation in Communities of Practice has helped shape many of the ideas and solutions for de-duplication in the integrated system. Below are some observations:

• Many of the projects have roots in defined or implicit communities of practice including: CDC�s INPHO, All Kids Count, Turning Point, and Genetics Planning and Data Integration grant (HRSA) � AKC Project- Best Practices.

• In addition to the information sharing that benefited many programs, there have been tool and technique transfers.

• Grant funding allows special projects to be done within a larger undertaking, which might otherwise not be possible.

• Sustainability is at risk with priority and funding changes. Belonging to a community of practice provides high visibility and external support to programs that may get buried in a changing department. Connections site visits served to bolster such projects.

• Communities of practice provide a forum for publicizing and disseminating best practices and research results.

• De-duplication processes will be ongoing as more health information is electronic and at the point of care or use. A continuing forum for the sharing of experience and techniques will be necessary to meet the needs of the varied programs and environments where child health information is integrated.

5.2.6 Program Mandates and Organizational Structure Obviously there are many external factors, like program mandates and organizational structure that can impact an integration project and specifically de-duplication efforts. Here are some specific observations made in this area during this study.

• Oregon has a legislative mandate for integration; in Utah, Rhode Island, New York City and Missouri, an executive mandate by the health officer/ commissioner establish the programmatic goals for integration. These include improving program

77

coordination and performance within constituent public agencies and providing better information for program planning and evaluating program effectiveness.

• However, there is a new customer-based focus to provide a coherent view of the health department to the outside community, particularly to aid families and providers in the care of children in a medical home.

• This requirement shines the spotlight on and places the greatest burden on the de-duplication activities because compromised data quality failure is public.

• Differences in the organizational structure and the placement of responsibility for programmatic and technical tasks vary among the projects. In some, de-duplication is centralized in the IT organization; in others it is decentralized to the programmatic components.

• Often, the IT organization is able to use more sophisticated tools and perform multiple iterations of automated processes.

• However, even where data quality assurance and de-duplication resides with the technical organization, programmatic participation is required to establish the business rules and the quality thresholds and to ultimately resolve certain records manually.

• Staff and resources for de-duplication may vary depending on whether legislative budget support is more favorable for programs with constituents or for IT as a general support activity for the Department.

• This may also affect decisions on whether to buy a product, use existing staff to develop software, use freeware, or use manual methods, depending on whether body count, contracts or just money are the budgetary targets.

• Changes in Administration or departmental leadership, policy and funding have already affected systems integration in NYC and KS. De-duplication is an �un-sexy,� but necessary activity for integration that may be threatened by these changes.

5.2.7 Academic Research Leveraging academic resources has benefited several projects.

• Utah and New York City de-duplication processes owe much to academic research. Choicemaker arose from research at New York University and was later developed into a software product; Utah has an ongoing partnership with Utah State University.

• Other projects have benefited from funded research from HRSA, CDC and AKC to contribute to the product�s effectiveness by working in an iterative way and modifying the product on the basis of the experience.

• Testing can provide only basic information about a product; using it and working collaboratively with the developer is the best approach but one not always practical for an organization.

5.2.8 Strategic Planning Strategic planning (both organization wide or specific to IT) preceded integration activities in most of the projects. These were driven primarily by programmatic coordination and service delivery goals, even if developed to support integration not foreseen initially. The best developed of these include a systems architecture which encompasses systems planned to be linked or integrated into a future system as well as initial core systems.

78

In some cases, technology goals of improving performance and streamlining processes resulted in the adoption of a system�s standards addressing the platforms supported, systems development, systems acquisitions and operational procedures. However, even with planning, changing the culture of constituent programs and incorporating legacy systems still challenge projects in data quality activities. Finally, strategic planning is important because programmatic requirements and systems characteristics identify records where information is linked rather than integrated, which may pose different types of de-duplication strategies.

5.3 Future Study There is still much that could be done in terms of de-duplication research that would be beneficial to child-health information systems. Below is a list of potential research projects, in no particular order.

5.3.1 Testing and Assessment A critical success factor for any information system project is the ability to test the system and measure its effectiveness. For de-duplication, this requires

• Meaningful data-quality metrics • Ways of describing or classifying different kinds of duplicates • Meaningful test data • Tools for measuring the data-quality of various data sets

The CDC De-duplication Toolkit is a first step in this direction. It provides a small, but useful set of test data, a duplication classification scheme, and a tool for measuring the number of duplication remaining in the test data set after de-duplication has been performed. The problem is that this toolkit was built for immunization registries and therefore doesn�t fully represent the type of information found in integrated child-health information systems. Also, the data set is relatively small and the frequency of the errors it contains is based on national statistics, and therefore, may not be very representative of the data for any given information system. A future research project could look at creating a new de-duplication toolkit that would provide

• A more robust set of data-quality metrics • A tool for generating data sets (instead of a providing a fix data set) that were

representative of locale-specific data characteristics • A more robust set of measurement tools

This research project could also review testing strategies and methods, as well as provide inside into how to manage testing activities, in general.

79

5.3.2 Useful Data Elements and Types of Comparisons More beneficial research could be done on the question of which data elements are the most value in the matching process and what types of comparisons are the most practical and effective. A future research project could experiment with different data elements and look at some of the more sophistication matching techniques, such as true probabilistic field comparisons.

5.3.3 Impact of Privacy Issues Another important area that needs considerable attention is how privacy concerns affect record matching and merging. For example, if an integrated system contains two records that may result the same child and one of them include some kind of indicator that means the child has �opted-out� of the system, then what consequence does it have for the matching and merging processes? Can that record be matched against others? If it is and a merge is performed, is the new, combined record flagged as �opted-out�. On the surface, these may seem like questions that an integration project simply has to answer for itself. However, there are profound consequences to their answers that represent interpretation of confidentiality policies and could establish an undesired precedent. Research into this issue could be of significant value to the public health community.

5.3.4 Birth-Death Matching Matching birth and death records is a sticky problem, usually solved with a manual process. This type of matching is often considered the Gold Standard in de-duplicated records. A study in Arkansas indicated that even small changes in the algorithm could affect the accuracy on this matching process. More research is needed to determine how it can be improved. 5.3.5 Organizational Support and Technical Assistance De-duplication comprises a set of informatics processes, which are widely used. Public Health is moving toward more data and file integration both in child health and through the Public Health Information Network (PHIN). This creates an even greater requirement for effective de-duplication. Ongoing de-duplication research activities even if performed within an individual organization will not benefit the public health community unless there is a forum for discussion of approaches and findings and for the dissemination of results and best practices. More likely, an individual organization will not be able to fund and support individual research, much less discussion. In addition, it would not have the benefit of knowing what other organizations might be researching. A future role for the Public Health Informatics Institute would be to provide organizational support and technical assistance for collaborative research on de-duplication as an extension of this Connections study within and across communities of public health practice.

80

References [1] Berry, M. Studies on Deduplication. (2003, March 6). Meeting Briefs. Rhode Island Department of

Health and HLN Consulting, Rhode Island. [2] Canavan, B. (2002, June, 25-27). Presentation on ALERT Immunization Registry. Connections Site Visit. [3] Coding Address Support System Technical Guide. (2003, January). Last retrieved September 6, 2003,

from Address Management, National Customer, Support Center, Memphis, TN. Web site: http://ribbs.usps.gov/files/cass/casstech.pdf

[4] Cummings, D. (1988). American English Spelling: An Informal Description. Baltimore: Johns Hopkins

University Press. [5] Deduplication Test Cases. Last retrieved September 24, 2003, from Centers for Disease Control and

Prevention. Web site: http://www.cdc.gov/nip/registry/dedup/dedup.htm [6] DHS/OFH FamilyNet Data Integration Strategic Plan. Version 1.0. (2003, April 9). Merge/Match/

Deduplication Requirements, DHS Office of Family Health, Prepared By CSG Professional Services, Inc., 5201 SW Westgate Drive, Suite 208, Portland, Oregon 97221.(503) 292-0859.

[7] Galhardas, H., Florescu, D., and Shasha, D. (2000). An Extensible Framework for Data Cleaning -

Retrieved October 18, 2003 From http://citeseer.nj.nec.com/galhardas00extensible.html [8] Green, S. and Lutz, R. (2002, August). Measuring phonological similarity: The case of personal names.

Retrieved June 6, 2003 from Language Analysis Systems, Inc. 2001. Web site: http://ww.las-inc.com/nameinfor/wp_lsa.htm.

[9] Laver, J. (1994). Principles of Phonetics. Cambridge: Cambridge University Press. [10] Lutz, R. and Greene, S. The use of phonological information in automatic name searching. Retrieved

June 6, 2003, from Language Analysis Systems, Inc., 2001. Web site: http://www.las-inc.com/extra/whitepapers/LAS_Phonology_White_Paper.pdf

[11] Patman, F. and Shaefer, L. (2002, August). Is soundex good enough for you? On the Hidden Risks of

Soundex-Based Name Searching. Last retrieved June 6, 2003, from Onomastix/Language Analysis Systems, Inc. 2001. Web site: http://www.las-inc.com/nameinfo/wp_soundex.htm

[12] Project Briefs. Last retrieved October 9, 2003, from

http://www.allkidscount.org/loose%20pages/briefs.html [13] Smith, Craig. Historical Record Name Authority and Standardization (Masters Thesis, Utah State

University, 2003) [14] User Manual for Deduplication Evaluation Toolkit. (2002, June) Retrieved September 24, 2003, from

Center for Disease Control and Prevention. Web site: http://www.cdc.gov/nip/registry/dedup/dedupkit.zip

[15] USPS - CASS� (Coding Accuracy Support System). Last retrieved September 6, 2003, from United State

Postal Service. Web site:http://www.usps.com/ncsc/addressservices/certprograms/cass.htm [16] USPS Vendors and Licenses. Retrieved September 6, 2003, from United States Postal Service. Web

site:http://www.usps.com/ncsc/ziplookup/vendorslicensees.htm

81

APPENDIX A � Additional Reference Material

Survey Questionnaire • Questionnaire-2003626.doc

Information from Rhode Island • Studies on Deduplication performed by Mike Berry of HLN Consulting, LLC

pursuant to GSA contract with RIDOH. • Matching Project Bibliography, RI_MatchingBibliography_1.pdf • A variation of the Matching Project Bibliography, RI_MatchingBibliography_2.pdf • A market survey of product vendors, RI_MarketSurvey.pdf

Information from Oregon • DHS/OFH FamilyNet Data Integration Strategic Plan Merge/Match De-duplication

Requirements, DHS Office of Family Health, Prepared By CSG Professional Services, Inc., 5201 SW Westgate Drive, Suite 208, Portland, Oregon 97221, (503) 292-0859, http://www.csgpro.com April 9, 2003, Version 1.0, OR_MergeMatchRequirementsV1.doc

Information from Maine • Description of manual de-duplication process, Manual_Dedup_Code.doc

Information from Missouri • Diagram of de-duplication process, MO_DiagramOfProcess.ppt • Data Quality and Assurance presentation, MO_DataQualityAndAssurance.ppt

Information from Arkansas Project • Interview with Doug Murray, Director of Vital Statistics,

AR_NotesFromInterviewWithDougMurray-20030626.doc • Issues in Linking Public Health Information Systems: An Art or Science,

AR_IssuesInLinkingSystems.pdf