10
Journal of Medical Systems, Vol. 28, No. 4, August 2004 ( C 2004) Improving Geocoding Practices: Evaluation of Geocoding Tools Duck-Hye Yang, 1,2 Lucy Mackey Bilaver, 1 Oscar Hayes, 1 and Robert Goerge 1 This study examined the sources of error involved in geocoding, by systematically evaluating the strengths and weaknesses of three widely used tools for geocoding. We tested them against a random sample of addresses from a state administrative ad- dress master file and found considerable variation in identification of census block geocodes of addresses. This high variation was mainly attributable to differences in preprocessing of addresses before geocoding and the reference street data used for geocoding. Preprocessing includes not only parsing and standardizing, but also cor- recting addresses against the US Postal Service Zip+4 Database, the master mailing address database maintained and updated regularly by USPS. KEY WORDS: geocoding; address matching; evaluation; geocoding tools; census block. INTRODUCTION Customer addresses are commonly available, collected on a regular basis by private and public sectors. By associating them with geographic information available from other sources, geocoding provides a basis for addressing questions applicable to many areas of interest—where a certain population resides, where needs are located, where resources exist, and where to target resources. Once an address is geocoded, community and demographic information (derived from the U.S. Census and other data sources) can be appended to the address based on the geocodes. The geocodes can later be used for accurate maps and spatial statistical analyses. Recognizing the importance of geocoding, the private sector has already geo- coded customer mailing addresses. Recently, as an agenda for reducing threats to national health, the U.S. government also launched a geocoding initiative. Healthy People 2010 Objective 23-3 states “increase the proportion of all major national, state, and local health data systems that use geocoding to promote nationwide use of Geographic Information Systems (GIS) at all levels.” (1) A recent call for proposals for 1 Chapin Hall Center for Children at the University of Chicago, Chicago, Illinois. 2 To whom correspondence should be addressed at 1313 E, 60th Street, Chicago, Illinois 60637; e-mail: [email protected]. 361 0148-5598/04/0800-0361/0 C 2004 Plenum Publishing Corporation

Improving Geocoding Practices: Evaluation of Geocoding Tools

Embed Size (px)

Citation preview

Page 1: Improving Geocoding Practices: Evaluation of Geocoding Tools

P1: KVK/JQX

Journal of Medical Systems [joms] pp1231-joms-488002 June 22, 2004 7:46 Style file version June 5th, 2002

Journal of Medical Systems, Vol. 28, No. 4, August 2004 ( C© 2004)

Improving Geocoding Practices: Evaluationof Geocoding Tools

Duck-Hye Yang,1,2 Lucy Mackey Bilaver,1 Oscar Hayes,1 and Robert Goerge1

This study examined the sources of error involved in geocoding, by systematicallyevaluating the strengths and weaknesses of three widely used tools for geocoding. Wetested them against a random sample of addresses from a state administrative ad-dress master file and found considerable variation in identification of census blockgeocodes of addresses. This high variation was mainly attributable to differences inpreprocessing of addresses before geocoding and the reference street data used forgeocoding. Preprocessing includes not only parsing and standardizing, but also cor-recting addresses against the US Postal Service Zip+4 Database, the master mailingaddress database maintained and updated regularly by USPS.

KEY WORDS: geocoding; address matching; evaluation; geocoding tools; census block.

INTRODUCTION

Customer addresses are commonly available, collected on a regular basis byprivate and public sectors. By associating them with geographic information availablefrom other sources, geocoding provides a basis for addressing questions applicable tomany areas of interest—where a certain population resides, where needs are located,where resources exist, and where to target resources. Once an address is geocoded,community and demographic information (derived from the U.S. Census and otherdata sources) can be appended to the address based on the geocodes. The geocodescan later be used for accurate maps and spatial statistical analyses.

Recognizing the importance of geocoding, the private sector has already geo-coded customer mailing addresses. Recently, as an agenda for reducing threats tonational health, the U.S. government also launched a geocoding initiative. HealthyPeople 2010 Objective 23−3 states “increase the proportion of all major national,state, and local health data systems that use geocoding to promote nationwide use ofGeographic Information Systems (GIS) at all levels.”(1) A recent call for proposals for

1Chapin Hall Center for Children at the University of Chicago, Chicago, Illinois.2To whom correspondence should be addressed at 1313 E, 60th Street, Chicago, Illinois 60637; e-mail:[email protected].

361

0148-5598/04/0800-0361/0 C© 2004 Plenum Publishing Corporation

Page 2: Improving Geocoding Practices: Evaluation of Geocoding Tools

P1: KVK/JQX

Journal of Medical Systems [joms] pp1231-joms-488002 June 22, 2004 7:46 Style file version June 5th, 2002

362 Yang, Bilaver, Hayes, and Goerge

prostate cancer geocoding by Centers for Diseases Control and Prevention (CDC)(2)

illustrates efforts to facilitate the use of geocoding at the state/local level.Various geocoding tools are available, ranging from one component of the com-

prehensive GIS package (e.g., ArcView and MapInfo), through general matchingtools that can also be used for geocoding, to specialized geocoding services providedby commercial firms. Despite the popular use of geocoding and the variety of geocod-ing tools currently in use, few systematic evaluations of those tools have been done,to the best of our knowledge. This study is part of recent efforts to improve geocodingthrough attention to methodological issues. A focus in geocoding has been shiftingfrom the match rate to matching accuracy. Although increasing the proportion ofaddresses that are geocoded has been of primary interest,(3) recently concerns havebeen raised that geocoding is vulnerable to various types of error at multiple stagesof the geocoding process.(4–7)

THE GEOCODING PROCESS

The first stage in the process is preprocessing addresses.(8) Geocoding toolshave built-in address cleaning functions, including parsing and standardizing. Parsingdissects a typical address into its elements. For example, “100 North Main StreetSuite 103 Chicago IL 60601” is parsed into nine elements: house number (100),predirection (North), street name (Main), street suffix (Street), unit type (Suite),unit number (103), city, state, and zipcode. By isolating individual address elements,parsing makes it easier to correct, standardize, and match data because it allowscomparison of individual elements rather than long strings. The appropriate parsingof address components is a crucial part in the record matching and linking process.

After parsing, address standardization correctly locates such elements as housenumbers, street names, PO boxes, apartment numbers, and rural routes. Some ad-dresses are not in order. For example, a standardization tool needs to correctly dis-tinguish apartment numbers from house numbers and standardize the variations to aconsistent form. For example, “North” is standardized as “N” and “Street” as “ST.”

Some geocoding tools also have address correction functionality. They deter-mine correct addresses against the USPS-maintained mailing address master file.Although the USPS master file actually contains every street number to which mailcan be delivered, geocoding tools use a commercially available product, which is itsshorter version that contains records that define address ranges.

After the addresses are enhanced through preprocessing, they enter into themain stage of geocoding—being matched against reference data. Reference datashould have both address information (e.g., house number and street name) and spa-tial information (e.g., latitude and longitude coordinates). There are various sourcesfor this reference data. When a study area is local, the best source is a local physicalplanning department that manages local street network data and updates changessuch as new roads and new subdivisions. For larger areas, the most popular sourceof reference data is the Topologically Integrated Geographic Encoding and Refer-encing (TIGER)/line file that was developed and updated by the U.S. Bureau of theCensus. Each record in the TIGER reference data represents a street segment with

Page 3: Improving Geocoding Practices: Evaluation of Geocoding Tools

P1: KVK/JQX

Journal of Medical Systems [joms] pp1231-joms-488002 June 22, 2004 7:46 Style file version June 5th, 2002

Improving Geocoding 363

two ranges of addresses that fall along that street segment, one for each side of thestreet (left or right). When geocoding an address, a tool searches, in a deterministicor probabilistic way, through the street segments in the reference data to find thesegment with address elements that most closely match the address.

Different geocoding software packages adopt different strategies for geocoding.There are variations in data and methods used for various geocoding stages—parsing,standardization, correcting, and matching. Without determining those variations andtheir impacts on geocoded results, there is no common basis for comparing outcomesof one study or project over another. This study is designed to fill that gap. First, weselected three geocoding tools currently in use and compared their data, features,and functions. Then we tested them against real-life addresses. We compared resultsby examining the variation in standardized addresses, the proportion of addressesthat were geocoded or not geocoded, and finally the variation in blocks identifiedfor input addresses. In short, we examined the extent to which geocoding tools differfrom each other, and explored their impacts on the variation, if any, in identificationof blocks by the three tools.

METHODS

We selected a random sample of 5000 addresses from the Integrated Database onChildren’s Services (IDB) in Illinois. The IDB is a relational database that combinesadministrative data collected by the state public welfare agencies for administrativepurposes.(9) The IDB comprises data from child welfare, juvenile justice, AFDC, andTANF (or public aid), Medicaid, special education, and mental health, and disabilitysystems. Because of the nature of the database, the sample data contains problematicaddresses: agency/community center names without specific address information, ru-ral route addresses, nearest street intersection, or postal box numbers. We excludedaddresses for residences outside Illinois because our reference data is exclusively forIllinois.

Three softwares chosen for evaluation include geocoding tools provided byArcView 8.3, Automatch, and ZP4/Geolytics. Table I summarizes a comparison oftheir features and outputs. It is followed by a detailed description of each tool.

ArcView 8.3 With StreetMap USA Extension

From several options for geocoding available from ArcView 8.3 (ESRI, Inc.,Redlands, CA), we selected ArcCatalog (one component of ArcView) and StreetMapUSA extension (license required). As reference data, we chose TIGER/Line 2002data processed by a proprietary line straightening process by Geographic Data Tech-nology (GDT). The GDT/TIGER data has North American Datum of 1983 (NAD83)with the address elements prestandardized based on the ESRI standardization rules.Because we focused on Illinois, we chose the state’s .edg file as reference data. Welowered the spelling sensitivity to 60 and the minimum matching score to 40. Adjust-ing the spelling and matching sensitivity downward would increase the match ratebut would also increase the matching error.

Page 4: Improving Geocoding Practices: Evaluation of Geocoding Tools

P1: KVK/JQX

Journal of Medical Systems [joms] pp1231-joms-488002 June 22, 2004 7:46 Style file version June 5th, 2002

364 Yang, Bilaver, Hayes, and Goerge

Table I. Three Geocoding Methods

ArcView 8.3 Automatch ZP4 + Geolytics

Software/Data Street Map USA, AutoStan and ZP4 andArcCatalog, and AutoMatch Geolytics DatasetArcMap

Preparation stageParsing Inbuilt AutoStan used Inbuilt

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Standardization Partial standardization AutoStan used Inbuilt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

User interactive Yes No YesAddress parsing/standardizing

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Address correction None None Against 2002 USPS Streetfunction Address Database

Matching stageMatching strategy Deterministic Probability record

approach linking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Reference data Enhanced GDT Prestandardized USPS ZIP+42002 TIGER/Line 1998 TIGER/Line Database

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Census Subsequently No more step Subsequentlyblock/Census tract appended doing a required: merged with

spatial join in Preappended to the Census data usingArcMap reference data ZIP+4 code

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

X and Y values Geocode to the Geocode to the ZIP+4 centroidstreet level street level

Output • Standardized • Standardized • Correctedaddresses addresses addresses w/error

types•Match score •Match score • zip+4• Longitude and • Longitude and • Longitude and

latitude of an latitude of an latitude of ZIP+4address address centroid• Census block and • Census block and • Census block

tract numbers tract numbers and tract numbers

Although ArcView allows users to interactively change how it standardizes theaddresses, because our primary interest is to examine the way it standardizes ad-dresses and its impact on outputs, we did not use the tool. ArcView does not providean address correction tool against USPS master file. As a result, users cannot validatethe input addresses that are not matched against the reference data.

The ArcView geocoding tool uses a deterministic approach as its matching strat-egy. In the deterministic approach, each element being compared is evaluated andgiven a score that tells how well it matched. ArcView uses the preprogrammedweights for each element of an address, weighting the house number the most and thestreet suffix the least. The scores are accumulated to generate an overall match scorewith a maximum of 100. The score is then compared to a user-specified minimummatch score. If the score exceeds the minimum, it is a match. If it falls short, it is nota match.

The outcome files resulting from geocoding include, among others, a shape(geometry) file with point features corresponding to each geocoded address and its

Page 5: Improving Geocoding Practices: Evaluation of Geocoding Tools

P1: KVK/JQX

Journal of Medical Systems [joms] pp1231-joms-488002 June 22, 2004 7:46 Style file version June 5th, 2002

Improving Geocoding 365

attribute table with its longitude and latitude coordinates along with original inputaddresses.

Because the street layer that ArcView uses as reference data does not containcensus block information, one more step was taken. We spatially joined the geocodedaddress points with census block polygons downloaded from the ESRI website. Aspatial join with the 1990 census block was first made to identify the 1990 census blockcontaining an address point within its boundary. A spatial join with the 2000 censusblock was subsequently made to add the 2000 census block and tract numbers. Thefinal output file contains coordinates, standardized addresses, matching flag (“M”indicates match, and “U” no match), and a match score (a range of 100–40).

Automatch

Automatch is the software by Matchware, founded by Matt Jaro who in 1985practically implemented probabilistic linkage methodology that had been theorizedby two pioneers in the late 1950s.(10,11) Matchware products are no longer available,but they have been integrated into a wide range of data integrity solution systems.The probabilistic linkage methodology can deal with nonexact matches, as often is thecase with the addresses. The method calculates a relative frequency of each specificvalue as an estimate of the probability that the specific value in question occurs inthe component. This relative frequency can be used to estimate the probability oftwo records having the same value. Weights are assigned to individual componentsof information and a composite weight is calculated that summarizes the closenessof agreement between two records.

Automatch is particularly useful with very large databases containing millionsof stored records. It utilizes various techniques such as blocking, sorting, a multi-pass approach, and EM algorithms to reduce complexity/time for searching linkablepairs of records for comparison in order to determine whether a given pair of recordsis matched. Matchware products are designed for general-purpose matching, such asintegrating two different sources of data, and they can be used for geocoding.

Two Matchware tools are required for geocoding: AutoStan for parsing and stan-dardizing and AutoMatch for matching an input address to street reference data. Wehad a 1998 TIGER/line file prestandardized by AutoStan so that each road seg-ment had an explicit census block/tract number as well as address ranges associatedwith each side. Unlike the spatial join involved with ArcView geocoding, addressgeocoding in Automatch simply involved returning the census block/tract code cor-responding to the appropriate side of the identified segment.

In geocoding with Automatch, we had to write a program called a specifica-tion file to specify blocking variables (e.g., zip or street phonetic coding or soundex),matching variables, matching probabilities, and minimum matching weights formatching. We took a multipass approach (7 passes) by varying blocking variables foreach pass to avoid leaving some actual matches undetected during previous passes.

ZP4 and Geolytics Census Data

Our last geocoding method includes a combined method quite different fromthe first two. First of all, two products from different companies were used: ZP4

Page 6: Improving Geocoding Practices: Evaluation of Geocoding Tools

P1: KVK/JQX

Journal of Medical Systems [joms] pp1231-joms-488002 June 22, 2004 7:46 Style file version June 5th, 2002

366 Yang, Bilaver, Hayes, and Goerge

(Semaphore Co.), and zip+4-level census data (Geolytics Co.). ZP4 is an address cor-rection software that has parsing/standardizing functions as well. ZP4 automaticallycorrects and standardizes every input address against a current and complete ZIP+4database maintained and updated by the USPS. It utilizes comprehensive lookuptables, such as a list of city names associated with each zip code.

Each record in Geolytics ZIP+4 database represents zip+4 and contains itscentroid and its associated census tract and block numbers, among others, for Census2000. The zip+4 centroid (latitude/longitude coordinates) is calculated by averagingall of the points making up the streets in the zip+4 from the US Census BureauTIGER 2000 file. Therefore, unlike Automatch and ArcView, the combined methodof ZP4 and Geolytics geocodes addresses to the zip+4 level rather than to the streetsegment level.

In addition to address correction, ZP4 is primarily used to obtain the correctzipcode/zip+4 combination for further processing. This is done by inputting the ad-dress, city, state, and zipcode into the program and having ZP4 output into newfields the cleaned address, city, zipcode, and zip+4. The zipcode/zip+4 combinationis then used to join the output file with Geolytics ZIP+4 data to append census blocknumbers and the zip+4 centroid. This join was done initially with the original zip-code/zip+4 combination; then, all data that was not initially matched was joined viazip and the first three digits of zip+4 (zip+3) and finally a last join was attempted byzip and the first two digits of the zip+4 (zip+2). It should be noted that zipcode/zip+3and zipcode/zip+2 joins are not going to be geocoded as accurately as using the orig-inal zipcode/zip+4 data; but depending on the application of analysis this may notpresent a problem; however, additional variables are created to indicate which typeof join was performed and these variables can be utilized for a more conservative orliberal approach to analysis.

RESULTS

Two preliminary results were compared and some modification of original inputaddress data was made. First, ArcView did not correctly identify misspelled addresseseven if the spelling score was lowered to 40 from the default value 80. For example,ArcView was not able to geocode addresses with “CHG” instead of “Chicago.”Second, when a house number contained a hyphen (e.g., 1002-41/2), Automatch tookthe whole as a house number instead of the first four digits. Automatch was not ableto properly geocode those addresses. Further clean up was done to fix these problemsand report the final results below.

The match rate, the percentage of the addresses geocoded, ranged from 82%(N = 3538) for ZP4+Geolytics to 86% (N = 3704) for Automatch to 88% forArcView (N = 3780). One reason for ZP4+Geolytics’ lower match rate was that Ge-olytics product did not have a comprehensive database of all Illinois zip+4 recordsproduced by the U.S. Postal Service. ZP4 assigned zip+4 code to 88% (N = 3785)of addresses, but 247 of the addresses were not matched with any record fromGeolytics database, lowering its match rate to 82%. Many of those 247 addresses

Page 7: Improving Geocoding Practices: Evaluation of Geocoding Tools

P1: KVK/JQX

Journal of Medical Systems [joms] pp1231-joms-488002 June 22, 2004 7:46 Style file version June 5th, 2002

Improving Geocoding 367

Table II. Percent of Geocoding Disagreement

% Not geocoded

Geocoded in Automatch ArcView ZP4+Geolytics

Automatch 0.0 4.5 10.2ArcView 6.4 0.0 10.1ZP4+Geolytics 5.9 3.9 0.0

had rural route addresses, indicating that Geolytics ZIP+4 database does not coverrural areas as completely as urban areas.

Table II shows the rate of geocoding disagreement among the three methods, orthe percentage of addresses that are geocoded by one method but not geocoded bythe other method. The lowest disagreement rate involves geocoding by Automatchand ArcView: 167 (4.5%) out of 3704 addresses geocoded by Automatch were notgeocoded by ArcView. The two highest disagreement rates involve geocoding byeither Automatch or ArcView and ZP4+Geolytics: 376 (10.2%) out of 3704 ad-dresses geocoded by Automatch were not geocoded by ZP4+Geolytics. Similarly,381 (10.1%) out of 3780 addresses geocoded by ArcView were not geocoded byZP4+Geolytics.

The two highest levels of geocoding disagreement involving either of the othertools and ZP4+Geolytics largely reflect the higher match rates by Automatch andArcView than by ZP4+Geolytics. Adjusting cutoff scores for matching upward inAutomatch and ArcView would have reduced the level of disagreement. Other po-tential sources of this disagreement include different standardization and matchingtechniques, a different level of address correction capacity, and incomplete refer-ence data. For many of the addresses that were not geocoded by ArcView but weregeocoded by ZP4+Geolytics, there was no single candidate found in the TIGERdata, suggesting a limitation due to the incomplete reference data.

Whereas Table II focused on the match rate, Table III focused on the accuracyof matching. It compared the extent to which blocks were geocoded identically bythe methods. As many as 28% (N = 988) of addresses that were geocoded both byArcView and Automatch (N = 3537) did not have identical census blocks. More thanhalf of those with different blocks (51%) did not have identical census tracts. Thispattern also holds true for addresses (N = 3399) that are geocoded both by ArcViewand ZP4+Geolytics. As many as 36% (N = 1215) of those addresses did not haveidentical census blocks, and about 33% did not have identical census tracts.

Table III. Percent of Block and Census Tract Disagreement

% Of those with different% With different % With different have blocks that

Geocoded by N block numbers census tracts different census tracts

Automatch and ArcViewa 3537 28 14 51ArcView and ZP4+Geolyticsb 3498 36 11 33

aFor census 1990 defined areas.bFor census 2000 defined areas.

Page 8: Improving Geocoding Practices: Evaluation of Geocoding Tools

P1: KVK/JQX

Journal of Medical Systems [joms] pp1231-joms-488002 June 22, 2004 7:46 Style file version June 5th, 2002

368 Yang, Bilaver, Hayes, and Goerge

A comparison of distribution of match scores given to addresses with differentblock numbers indicates that at least one-fourth experienced discrepancies in morethan one element when compared with their best candidate street segments from thereference data. A comparison of standardized addresses and original ones revealedthat zipcodes changed most often—for 20% of addresses with different block num-bers and for more than 50% of addresses with different block and tract numbers.Relatively unreliable zipcodes are particularly problematic for Automatch becauseof its heavy reliance on zipcodes when searching for a match pair. Choosing highmatch cutoff points in early passes appears to be particularly important in order toavoid the mismatch problem. For example, we set a relatively low cutoff point whenusing the zipcode as a blocking variable in the first pass. The first pass alone gener-ated as many as 82% of the total matches. It was highly likely that wrong pairs werepicked up as a match in the first pass.

Different standardization also appears to have caused the mismatch problem.An examination of standardized addresses by ArcView revealed that ArcView didnot correctly isolate unit type and unit numbers. For example, Main 2FL (2nd floor)was not parsed into relevant elements such as Main, 2, and FL, causing the streetname to be Main 2FL instead of Main. The ArcView parsing tool seems not to bean intelligent pattern recognition tool. It may explain why this error did not happenwhen the street suffix (here ST) was not missing. That is, when an element for thestreet suffix was not missing, the unit type and unit value were correctly recognized.By contrast, ZP4 and Automatch correctly recognized unit type and unit values evenwhen the street suffix was missing.

The ZP4+Geolytics method also appears to be far from perfect. After zip+4codes were assigned by ZP4, only 70% of the total addresses joined with Geolyticscensus data were done via zip+4. The remaining 30% had to be joined via the firstthree digits of zip+4 (zip+3) and finally the first two digits (zip+2), suggesting thatits geocoding results may be vulnerable to errors.

Finally, many of the addresses geocoded differently by the methods were ruralroute addresses, suggesting that better methods not relying on the TIGER referencedata might be needed to handle rural route addresses.

Table IV shows that a high percentage of the addresses to which ZP4 failed toassign zip+4 codes (N = 510) were geocoded by ArcView or Automatch; more thanhalf of the addresses were geocoded by each of the other methods (55% by ArcViewand 53% by Automatch).

Table IV. Error Types for Addresses That Were Not Successfully Assigned by ZP4 (N = 510)

% of addresses geocoded % of addresses geocodedby ArcView by Automatch

Number (282) Percent (55) Number (269) Percent (53)

Error type reported by ZP4City not found 2 0.7Street not found 126 44.7 116 43.1Address not found 115 40.9 112 41.6zip+4 unavailable 11 3.9 11 4.1Multiple streets match 30 10.7 28 10.4

Page 9: Improving Geocoding Practices: Evaluation of Geocoding Tools

P1: KVK/JQX

Journal of Medical Systems [joms] pp1231-joms-488002 June 22, 2004 7:46 Style file version June 5th, 2002

Improving Geocoding 369

The addresses to which ZP4 failed to assign zip+4 codes showed some patterns.The majority were not geocodable because they did not provide mailing addresses.Examples include names of a shelter, agency, or jail. Some addresses appear to bemobile homes. For some addresses, however, ZP4 did not correctly clean them. Forexample, ZP4 did not successfully assign zip+4 code if the addresses contained acomma or parentheses. Lastly, if ZP4 identified multiple addresses for a given address,it did not assign a zip+4 code. For example, if the USPS address database lists botha 100 E MAIN ST and a 100 W MAIN ST, then the input 100 MAIN ST causesmultiple matches in ZP4. In this case, more than one street segment in the databasematches the input address, and there is not enough information to let ZP4 choosefrom among the multiple choices. This limitation reflects the main purpose of ZP4,a product that is designed mainly to ensure delivery to correct addresses.

The comparison with the unmatched ZP4 addresses, reveals a low level ofgeocoding precision by ArcView and Automatch in some circumstances. ArcViewand Automatch geocoded addresses even if their house numbers did not exist within agiven address range. ArcView simply picked the one with the highest match scoreamong candidates (in other words, multiple addresses). Therefore, one safeguard foravoiding this problem would be to set the minimum match score at 80 or higher.

CONCLUSION

Our study affirmed recent concerns over the accuracy of geocoding. In terms ofidentification of census blocks, our test showed that 28–36% of addresses geocodedturned out to be not identical. In terms of census tracts, 33–51% failed to be identical.This finding implies that geocoding using different methods would result in differentdatasets on which study results and recommendations would be based. To identifythe sources of variation, we evaluated three geocoding tools in terms of the way itparses, standardizes, corrects, and matches real-world addresses. The best strategy toimprove geocoding would be to use accurate data—the USPS address master file—for correcting addresses and further enhanced street reference data for accurategeographic location information. Our study confirms the importance of efforts toimprove data “infrastructure” for providing a basis for valid GIS studies.

We also identified weaknesses and strengths of each tool. ArcView is relativelypoor at parsing and handling misspelled street/city names. There is no address cor-rection function in it. However, ArcView is the only one that can handle streetintersection addresses. Furthermore, ArcView does not require much preparationfor geocoding. However, ArcView geocoding requires one more step (i.e., spatialjoin) to append census information. It also requires a change in map projectionsfor data involved in spatial join. Spatial join takes a considerable amount of timedepending on the size of a study area. There is an advantage of doing spatial join,however. Users can easily link 1990 census information to 2000 census by appendingboth based on the coordinates associated with each address point. Finally, ArcViewprovides a seamless connectivity if one wants to proceed further for a map displayor spatial analysis.

Automatch requires skills in programming and handling data. It is not a user-friendly automatic process. However, it provides flexibility in choosing weights for

Page 10: Improving Geocoding Practices: Evaluation of Geocoding Tools

P1: KVK/JQX

Journal of Medical Systems [joms] pp1231-joms-488002 June 22, 2004 7:46 Style file version June 5th, 2002

370 Yang, Bilaver, Hayes, and Goerge

each address element and customizing the programs. Once its procedure is set upin terms of programs required for geocoding, it can easily perform geocoding withmillions of address records in a surprisingly short time. Its parsing/standardizingtool is particularly useful for real-life addresses because it is an intelligent patternrecognition tool. Like ArcView, however, Automatch does not have the addresscorrection function in it. Furthermore, because Automatch is a sophisticated tool,users can easily be blind to error.

Unlike the first two tools, the third one does not geocode to the street level.This partly explains why Geolytics does not provide all zip+4 records that shouldbe associated with addresses cleaned and revised by ZP4 in the early stage. ZP4does have an address correction function. However, there is one disadvantage sinceits primary goal is to ensure delivery to correct addresses. If input addresses areproblematic enough to confuse the software with more than one candidate, they arenot revised.

From our discussions of comparative weaknesses and strengths of each package,the best strategy for geocoding would be to combine the three methods. The first stepis to use ZP4 to correctly determine the input addresses. Then the enhanced addressescan be inputted to ArcView or Automatch to geocode to the street level, dependingon the user’s preference for user friendliness or flexibility.

Finally, it is important to note that geocoding tools rely most heavily on housenumber and street names. Parsing/standardizing tools can greatly enhance the qualityof these critical address elements, but improved address collection at the very begin-ning stage—data entry—will be critical for improving the accuracy of geocoding.

REFERENCES

1. Healthy People 2010, Vol. I, 2nd edn. <http://www.healthypeople.gov/document/tableofcontents.htm>(Accessed on Oct. 2003).

2. CDC Cooperative Agreement Funding Opportunities. <http://www.cdc-cafunding.org/peps/2003peps/pep008.htm> (Accessed on Oct. 2003).

3. McElroy, J. A., Remington, P. L., Trentham-Dietz, A., Robert, S. A., and Newcomb, P. A., Geocodingaddresses from a large population-based study: Lessons learned. Epidemiology 14(4):399–407, 2003.

4. Krieger, N., Waterman, P., Lemieux, K., Zierler, S., and Hogan, J. W., On the wrong side of the tracts?Evaluating the accuracy of geocoding in public health research. Am. J. Public Health 91(7):1114–1116,2001.

5. Krieger, N., Place, space, and health: GIS and epidemiology. Epidemiology 14(4):384–385, 2003.6. Bonner, M. R., Han, D., Nie, J., Rogerson, P., Vena, J. E., and Freudenheim, J. L., Positional accuracy

of geocoded addresses in epidemiologic research. Epidemiology 14(4):408–412, 2003.7. Hurley, S. E., Saunders, T. M., Nivas, R., Hertz, A., and Reynolds, P., Post office box addresses: A

challenge for geographic information system-based studies. Epidemiology 14(4):386–391, 2003.8. Cochinwala, M., Dalal, S., Elmagarmid, A. K., and Verykios, V. S., Record Matching: Past, Present and

Future. <http://www.cs.nyu.edu/cs/faculty/shasha/papers/verykios085.pdf> (Accessed on Sept. 2003).9. Goerge, R., Voorhis, V. J., and Lee, B. J., Illinois’s longitudinal and relational child and family research

database. Soc. Sci. Comp. Rev. 12(3):351–365, 1994.10. National Research Council. Record Linkage Techniques-1997. Proc. Int. Workshop Exposition, 1999.

<http://www.nap.edu/books/NI000997/html/> (Accessed on Sept. 2003).11. Jaro, M. A., Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of

Tampa, Florida. J. Am. Stat. Assoc. 84(406):414–420, 1989.