28
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Embed Size (px)

Citation preview

Page 1: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Introduction to Record Linking

John M. Abowd and Lars VilhuberApril 2011

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Page 2: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Overview

• Introduction to record linking What is record linking, what is it not, what is the theory?

• Record linking: Applications and examplesHow do you do it, what do you need, what are the possible complications?

• Examples of record linkingDo it yourself record linking

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Page 3: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

From Imputing to Linking

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Precision of link

Availability of linked data

“Statistical record linkage”Merge match with imperfect

link variables

“Statistical record linkage”Merge match with imperfect

link variables

“Massively imputed”Common variables/ values, but datasets

can’t be linked

“Massively imputed”Common variables/ values, but datasets

can’t be linked

“Simulated data”No common observations

“Simulated data”No common observations

“Classical”Merge match

by link variable

“Classical”Merge match

by link variable

Page 4: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Definitions of Record Linkage

• “a procedure to find pairs of records in two files that represent the same entity”

• “identify duplicate records within a file”

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Page 5: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Uses of Record Linkage

• Merging two files for micro-data analysis– CPS base survey to a supplement– SIPP interviews to each other– Merging years of Business Register– Merging two years of CPS– Merging financial info to firm survey

• Updating/unduplicating a survey frame or a electoral list– Based on business lists– Based on tax records

• Disclosure review of potential public use micro-data

Page 6: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Uses of Record Linkage(private-sector applications)

• Merging two files …– Credit scoring– Customer lists after merger– Internal files when consolidating/upgrading software

• Updating/unduplicating junk mail lists– Based on multiple sources of lists

• Disclosure review of potential public use micro-data– Not done… (Netflix case)

Page 7: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Types of Record Linkage• Merging two files for micro-data

analysis– CPS base survey to a supplement– SIPP interviews to each other– Merging years of Business Register– Merging two years of CPS*– Merging financial info to firm survey

• Updating a survey frame or a electoral list– Based on business lists– Based on tax records

• Disclosure review of potential public use micro-data

Deterministic linkage: survey-provided IDs

Probabilistic linkage: imperfect or no IDs

Probabilistic linkage: imperfect or no IDs

Probabilistic linkage: no IDs

Page 8: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Methods of Record Linkage

• Probabilistic record linkage (PBRL)– non-parametric methods– regression-based methods

• Distance-based record linkage (DBRL)– Euclidean distance– Mahalanobis distance– Kernel-based distance

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Page 9: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Need for Automated Record Linkage

• RA time required for the following matching tasks:– Finding financial records for

• Fortune 100: 200 hours (Abowd, 1989)• 50,000 small businesses: ??? hours

– Identifying miscoded SSNs • on 60,000 wage records: several weeks• on 500 million wage records: ????

– Unduplication of the U.S. Census survey frame (115,904,641 households): ????

– Longitudinally linking the 12 million establishments in the Business Register: ????

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Page 10: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Basic Definitions and Notation

• Entities • Associated files• Records on files• Matches

• Nonmatches

BbAa ,

)(),( BA

babaM |)(),(

babaU |)(),(

)(),( ba

Page 11: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Comparisons

• Comparison function maps comparison space into some domain:

• Comparison vector

• PBRL: Agreement pattern, finitely many values, typically {0,1}, but can be Reals

• DBRL: distance (scalar)

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

)()(: baC

1)dim(,

Page 12: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Linkage Rule

• A linkage rule defines a record pair’s status based on it’s comparison value– Link (L)– Undecided (Clerical, C)– Non-link (N)

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

NCL ,,:F

Page 13: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

In a perfect world…LbaMba ))(),(())(),((

and

NbaUba ))(),(())(),((

Page 14: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Linkage Rules Depend on Context

• PBRL:– For matching: rank by agreement ratios, use cutoff

values to classify into {L,C,U}– For disclosure-analysis: rank by agreement ratios,

classify as {L} if true link (M) is among top j pairs• DBRL:

– Rank pairs by distance, link closest pairs

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Page 15: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Probabilistic Record Linkage

Page 16: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Example Agreement Pattern

• 3 binary comparisons test whether– γ1 pair agrees on last name

– γ2 pair agrees on first name

– γ3 pair agrees on street name

• Simple agreement pattern:γ=(1,0,1)

• Complex agreement pattern:γ=(0.66,0,0.8)

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Page 17: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Conditional Probabilities

• Probability that a record pair has agreement pattern γ given that it is a match [nonmatch]

P(γ|M)P(γ|U)

• Agreement ratioR(γ) = P(γ|M) / P(γ|U)

This ratio will determine the distinguishing power of the comparison γ.

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Page 18: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Error Rates

• False match: a linked pair that is not a match (type II error)

• False match rate: probability that a designated link (L) is a nonmatch: μ=P(L|U)

• False nonmatch: a nonlinked pair that is a match (type I error)

• False nonmatch rate: probability that a designated nonlink is a match: λ=P(N|M)

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Page 19: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Fundamental Theorem

1. Order the comparison vectors {γj } by R(γ)2. Choose upper Tu and lower Tl cutoff values

for R(γ)3. Linkage rule:

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

elseC

)(ifN

)(ifL

:F

j

ljj

ujj

TR

TR

Page 20: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Fundamental Theorem (cont.)

• Error rates are

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

N

jjj

L

jjj

jj

jj

MPNPMP

UPLPUP

)|()|()|(

)|()|()|(

Page 21: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Fundamental Theorem (3)

• Fellegi & Sunter (JASA, 1969):If the error rates for the elements of the comparison vector are conditionally independent, then given the overall error rates (m,l), the linkage rule F minimizes the probability associated with an agreement pattern g being placed in the clerical review set. (optimal linkage rule)

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Page 22: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Applying the Theory

• The theory holds on any subset of match pairs (blocks)

• Ratio R: matching weight or total agreement weight

• Optimality of decision rule heavily dependent on the probabilities P(γ|M) and P(γ|U)

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Page 23: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Distance-Based Record Linking

Page 24: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Distance-Based Record Linking

• Distance between any pair of records

can be generally defined as

),())(),(( ba

)(),(2)()()(),( 12 BACovBVarAVard

Page 25: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

DBRL: 4 cases• Mahalanobis distance,

known covariance

• Mahalanobis distance, unknown covariance

0),( BACov

),( BACov

Page 26: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

DBRL: 4 cases• Euclidean distance, unstandardized inputs

• Euclidean distance, standardized inputs

)1,0(~)(

~N

FVar

FFF

IBACovBVarAVar ),(2)()(

IBACovBVarAVar )~

,~

(2)~

()~

(

Page 27: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Linkage rules

• Matching: Sort by distance, choose top j pairs as matches

• Disclosure analysis: Sort by distance, identify true matches among top j pairs

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Page 28: Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Acknowledgements• This lecture is based in part on a 2000 and 2004 lecture given by William Winkler, William

Yancey and Edward Porter at the U.S. Census Bureau• Some portions draw on Winkler (1995), “Matching and Record Linkage,” in B.G. Cox et. al.

(ed.), Business Survey Methods, New York, J. Wiley, 355-384.• Some (non-confidential) portions drawn from Abowd, Stinson, Benedetto (2006), “Final

Report to Social Security Administration on the SIPP/SSA/IRS Public Use File Project”

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved