Upload
neith
View
30
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Deterministic Record Linking. University of North Carolina, Chapel Hill Hye-Chung Kum. Example. Exact Match. Approximate Matching I : SSN. Approximate Matching II : DOB. Approximate Matching III : Name. Deterministic Record Linking. Allow for approximate matching - PowerPoint PPT Presentation
Citation preview
Deterministic Record Linking
University of North Carolina, Chapel Hill
Hye-Chung Kum
Example
EISID : E1 EISID : E2 EISID : E3 EISID : E4
ssn : 085-66-9980first name : Sallylast name : HillMI : LDOB : 3/4/1999
ssn : 143-25-9304first name : Emilylast name : BrownMI : KDOB : 6/2/2004
ssn : 354-563-2343first name : Marylast name : JohnsonMI : GDOB : 5/13/1983
ssn : 532-34-9183first name : Davidlast name : FordMI : JDOB : 10/25/1990
SISID : S1 SISID : S2 SISID : S3 SISID : S4
ssn : 085-66-9980first name : Sallylast name : HillMI : LDOB : 3/4/1999
ssn : 143-52-9304first name : Emilylast name : BrownMI : KDOB : 6/2/2004
ssn : 354-563-2343first name : Marylast name : HawkinsMI : JDOB : 5/13/1983
ssn : 532-34-9183first name : Davidlast name : FordMI : JDOB : 10/23/1990
Exact Match
EISID : E1
ssn : 085-66-9980first name : Sallylast name : HillMI : LDOB : 3/4/1999
SISID : S1
ssn : 085-66-9980first name : Sallylast name : HillMI : LDOB : 3/4/1999
Approximate Matching I : SSN
EISID : E2ssn : 143-2525-9304first name : Emilylast name : BrownMI : KDOB : 6/2/2004
SISID : S2ssn : 143-5252-9304first name : Emilylast name : BrownMI : KDOB : 6/2/2004
Approximate Matching II : DOB
EISID : E4ssn : 532-34-9183first name : Davidlast name : FordMI : JDOB : 10/25/199010/25/1990
SISID : S4ssn : 532-34-9183first name : Davidlast name : FordMI : JDOB : 10/23/199010/23/1990
Approximate Matching III : Name
EISID : E3ssn : 354-563-2343first name : Marylast name : JohnsonJohnsonMI : GGDOB : 5/13/1983
SISID : S3ssn : 354-563-2343first name : Marylast name : HawkinsHawkinsMI : JJDOB : 5/13/1983
Deterministic Record Linking
Allow for approximate matching Use explicit approximate rules Pros : can control the linkage process Con: difficult to implement Alternative : Probabilistic record linking
– Also approximate matching– However, uses general rules specified by users– Based on total probability – Con: can not control exactly what to consider a match or not– Pros: can use specialized software
Approximate Matching : DOB
element to element match : date, month, year Allow for one element difference Allow for month and day transposed
DOB : one element
dob1 : 10/2525/1990dob2 : 10/2323/1990
DOB : transpose
dob1 : 11/7/11/7/1995dob2 : 7/11/7/11/1995
Approximate Matching : Name
First name soundex match First name is approx
– one letter different insert or replace
– and/or substr lsound equal or lname approx
– MI=FI– FI equal
Fsound & Lsound swapped
obs fname kfname mi kmi
1 RUDOLPH RULLDOLPH A A
2 ALIJJAH ALIYYAH M
3 CAROL CAROLYNYN J
4ANGELIQULIQUEE ANGIIE D
5 JOHNNY JOHNNY JRJR L
6 ZACHARYHARY ZACKK L
7 J MMICHAEL MM
8 AANTON CCOUDRAY CC AA
9 AARTHUR AAUTHOR R R
10 EEDWIN EEDDIE
11 GOLDY OWENS A A
Approximate Matching : Name
obs fname kfname mi kmi lname klname
1 RUDOLPH RULLDOLPH A A SIMARD SIMARD
2 ALIJJAH ALIYYAH M FOSS FOSS
3 CAROL CAROLYNYN J YOUNG YOUNG
4 ANGELIQUELIQUE ANGIIE D OUELLETTE OUELLETTE
5 JOHNNY JOHNNY JRJR L MAYO MAYO
6 ZACHARYHARY ZACKK L ROGERS ROGERS
7 J MMICHAEL MM GALLAGHER GALLAGHER
8 AANTON CCOUDRAY CC AA CYPRESS CYPRESS
9 AARTHUR AAUTHOR R R DAVIS DAVIS
10 EEDWIN EEDDIE KAHKONE KAHKONE
11 GOLDYGOLDY OWENSOWENS A A OWENSOWENS GOLDYGOLDY
Match on ssn (ssn equal)
1 : dob, fsound equal dob approx
– 2 : dob approx, fsound equal– 3 : dob approx, fname approx– 4 : dob approx, lsound equal, & fsound diff, but MI=FI– 5 : dob approx, lsound equal, & fsound diff, but FI equal– 6 : dob approx, lsound and fsound swapped– 7 : dob approx, lname approx & fsound diff
but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal)
dob mismatch– 8 : fname approx, lsound equal, and dob diff– 9 : fname approx, lsound approx, and dob diff
Match on ssn (ssn equal)
1 : dob, fsound equal dob approx
– 2 : dob approx, fsound equal– 3 : dob approx, fname approx– 4 : dob approx, lsound equal, & fsound diff, but MI=FI– 5 : dob approx, lsound equal, & fsound diff, but FI equal– 6 : dob approx, lsound and fsound swapped– 7 : dob approx, lname approx & fsound diff
but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal)
dob diff– 8 : fname approx, lsound equal, and dob diff– 9 : fname approx, lsound approx, and dob diff
Approximate Matching : SSN
Digit to digit match Allow for one digit difference Allow for two digit difference if transposed
SSN : one digit
ssn1 : 532-34-99183ssn2 : 532-34-88183
SSN : transpose
ssn1 : 143-2525-9304ssn2 : 143-5252-9304
Match on ndob (dob+fsound)
ssn missing– 1: lname equal– 2: lname approx
ssn approx– 3: lname equal– 4: lname approx– 5: lname diff
but fname equal
ssn different– 11 : lname equal– 12 : lname approx
lname different– 51: ssn approx– 52: ssn missing
Match on ndob (dob+fsound)
ssn missing– 1: lname equal– 2: lname approx
ssn approx– 3: lname equal– 4: lname approx– 5: lname diff
but fname equal
ssn different– 11 : lname equal– 12 : lname approx
lname different– 51: ssn approx– 52: ssn missing
obs SSN kSSN fname kfname lname klname
1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS
2 . . ABEL ABELLOMELIGARCIGARCI
AA LOMELI
3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS
obs SSN kSSN fname kfname lname klname
1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS
2 . . ABEL ABELLOMELIGARCIGARCI
AA LOMELI
3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS
4243352044
55243352055
44 LENA LENA COOPER COOPER
5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT
obs SSN kSSN fname kfname lname klname
1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS
2 . . ABEL ABELLOMELIGARCIGARCI
AA LOMELI
3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS
4243352044
55243352055
44 LENA LENA COOPER COOPER
5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT
6 2451193584 2454493584 MARTHA MARTHA LYDALYDA HOPKINSHOPKINS
obs SSN kSSN fname kfname lname klname
1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS
2 . . ABEL ABELLOMELIGARCIGARCI
AA LOMELI
3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS
4243352044
55243352055
44 LENA LENA COOPER COOPER
5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT
6 2451193584 2454493584 MARTHA MARTHA LYDALYDA HOPKINSHOPKINS
7 2447799182 2447788182 AUSTIN AUSTIN AUSTYNAUSTYN TERWILLIGERTERWILLIGER OMEARAOMEARA
8 4899875113 4899875773 ALISIAALISIA ALICE ALICE GRAVESGRAVES WATSONWATSON
9 2399665668 2399665778 ANNAANNA ANAYAANAYA MONTAGUEMONTAGUE BOLDINGBOLDING
obs SSN kSSN fname kfname lname klname
1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS
2 . . ABEL ABELLOMELIGARCIGARCI
AA LOMELI
3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS
4243352044
55243352055
44 LENA LENA COOPER COOPER
5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT
6 2451193584 2454493584 MARTHA MARTHA LYDALYDA HOPKINSHOPKINS
7 2447799182 2447788182 AUSTIN AUSTIN AUSTYNAUSTYN TERWILLIGERTERWILLIGER OMEARAOMEARA
8 4899875113 4899875773 ALISIAALISIA ALICE ALICE GRAVESGRAVES WATSONWATSON
9 2399665668 2399665778 ANNAANNA ANAYAANAYA MONTAGUEMONTAGUE BOLDINGBOLDING
obs SSN kSSN fname kfname lname klname
1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS
2 . . ABEL ABELLOMELIGARCIGARCI
AA LOMELI
3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS
4243352044
55243352055
44 LENA LENA COOPER COOPER
5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT
6 2451193584 2454493584 MARTHA MARTHA LYDALYDA HOPKINSHOPKINS
7 2447799182 2447788182 AUSTIN AUSTIN AUSTYNAUSTYN TERWILLIGERTERWILLIGER OMEARAOMEARA
8 4899875113 4899875773 ALISIAALISIA ALICE ALICE GRAVESGRAVES WATSONWATSON
9 2399665668 2399665778 ANNAANNA ANAYAANAYA MONTAGUEMONTAGUE BOLDINGBOLDING
10227691655
55227691633
33 BRITTNEY BRITTNEY REVELS REVELS
1124233992423399
131323952442395244
0202 DANIEL DANIEL ROBINSON ROBINSON
1222186482218648
5252 22520602252060
1717 HELEN HELEN HAALL HOOLLERER
1324021242402124
898922256562225656
0404 DEBORAHDEBORAH DEBRADEBRA LEEE LEACHACH
obs SSN kSSN fname kfname lname klname
1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS
2 . . ABEL ABELLOMELIGARCIGARCI
AA LOMELI
3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS
4243352044
55243352055
44 LENA LENA COOPER COOPER
5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT
6 2451193584 2454493584 MARTHA MARTHA LYDALYDA HOPKINSHOPKINS
7 2447799182 2447788182 AUSTIN AUSTIN AUSTYNAUSTYN TERWILLIGERTERWILLIGER OMEARAOMEARA
8 4899875113 4899875773 ALISIAALISIA ALICE ALICE GRAVESGRAVES WATSONWATSON
9 2399665668 2399665778 ANNAANNA ANAYAANAYA MONTAGUEMONTAGUE BOLDINGBOLDING
10227691655
55227691633
33 BRITTNEY BRITTNEY REVELS REVELS
1124233992423399
131323952442395244
0202 DANIEL DANIEL ROBINSON ROBINSON
1222186482218648
5252 22520602252060
1717 HELEN HELEN HAALL HOOLLERER
1324021242402124
898922256562225656
0404 DEBORAHDEBORAH DEBRADEBRA LEEE LEACHACH
obs SSN kSSN fname kfname lname klname
1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS
2 . . ABEL ABELLOMELIGARCIGARCI
AA LOMELI
3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS
4243352044
55243352055
44 LENA LENA COOPER COOPER
5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT
6 2451193584 2454493584 MARTHA MARTHA LYDALYDA HOPKINSHOPKINS
7 2447799182 2447788182 AUSTIN AUSTIN AUSTYNAUSTYN TERWILLIGERTERWILLIGER OMEARAOMEARA
8 4899875113 4899875773 ALISIAALISIA ALICE ALICE GRAVESGRAVES WATSONWATSON
9 2399665668 2399665778 ANNAANNA ANAYAANAYA MONTAGUEMONTAGUE BOLDINGBOLDING
10227691655
55227691633
33 BRITTNEY BRITTNEY REVELS REVELS
1124233992423399
131323952442395244
0202 DANIEL DANIEL ROBINSON ROBINSON
1222186482218648
5252 22520602252060
1717 HELEN HELEN HAALL HOOLLERER
1324021242402124
898922256562225656
0404 DEBORAHDEBORAH DEBRADEBRA LEEE LEACHACH
14 238995019 . ABIGAHIL ABIGAHIL GARCIAGARCIATREJO TREJO
15 . . APSLEY APSLEY CARLYLE KARYLE
16 . . ABIGAIL ABIGAIL GENTRY KING
17 237999685 . ABIGAIL ABIGAIL RODRIGUEZRINCON HERNANDEZ
18 237998504 . ABIGAYLE ABIGAIL FITZGERALD HERNANDEZ
Match on name (fname+lname)
ssn missing & dob approx
– 1: MI equal– 7: MI missing– 8: MI not equal
ssn approx– 3: dob equal– dob approx
4: one element 5: transpose
Match on name (fname+lname)
ssn missing & dob approx
– 1: MI equal– 7: MI missing– 8: MI not equal
ssn approx– 3: dob equal– dob approx
4: one element 5: transpose
obs ssn kssn dob kdob
13626220104
7 32626201047 09/06/0909 09/06/0808
23134141690
6 31314146906 12/09/7575 12/09/7676
3244638105
6 2116381056 07/ 1515/20 07/0707/20
4238013800
33 23801383030 11/1211/12/1412/1112/11/14
524119110
44 2411911033 12/0812/08/9408/1208/12 /94
obs Type ssn kssn dob kdob fname lname
1 4 362622010473262620104
7 09/06/0909 09/06/0808 MARION MONTAGUE
2 4 313414169063131414690
6 12/09/7575 12/09/7676 WILLIAM JOHNSON
3 4 2446381056211638105
6
07/ 1515/2
0 07/0707/20 WILLIE GRANT
4 5 23801380303238013833
0011/1211/12/1
412/1112/11/1
4 GLADYS SOUTHARD
5 5 241191104424119110
3312/0812/08/9
408/1208/12 /
94 TAYLOR FORD
6 52 272318863327231886
00 09/11/77 . NICOLE PARKER
7 52 5781111773 578111111
3 07/07/88 . ASAJAH ROSS
8 100 120688146612068814
220101/31/00
551010/31/99
99 PATRICIA BANEGAS
9 100 13368078080133680799
88 0101/12/88 0202/12/88 DANIEL ANDRONIC
10 100 1327669052132755905
202/2702/27/8
9 11/1511/15/8
9 VICTORIA HORN
Match on name (fname+lname)
obs Type ssn kssn dob kdob fname lname
1 4 362622010473262620104
7 09/06/0909 09/06/0808 MARION MONTAGUE
2 4 313414169063131414690
6 12/09/7575 12/09/7676 WILLIAM JOHNSON
3 4 2446381056211638105
6
07/ 1515/2
0 07/0707/20 WILLIE GRANT
4 5 23801380303238013833
0011/1211/12/1
412/1112/11/1
4 GLADYS SOUTHARD
5 5 241191104424119110
3312/0812/08/9
408/1208/12 /
94 TAYLOR FORD
6 52 272318863327231886
00 09/11/77 . NICOLE PARKER
7 52 5781111773 578111111
3 07/07/88 . ASAJAH ROSS
8 100 120688146612068814
220101/31/00
551010/31/99
99 PATRICIA BANEGAS
9 100 13368078080133680799
88 0101/12/88 0202/12/88 DANIEL ANDRONIC
10 100 1327669052132755905
202/2702/27/8
9 11/1511/15/8
9 VICTORIA HORN
Match on name (fname+lname)
link
Put together all links found Identify indirect duplicates (type2>10000)
– i.e. both EISID1 & EISID2 link to identical SISID1– Consider indirect duplicates on both EIS & SIS
Create unique link and indirect duplicate files– Keep only the first id in data file link– Create indirect duplicates files
dupeis2 & dupsis2 TODO : explore indirect duplicates
Create unique list of EIS & SIS
Generate unique full list of each set of ids– use linkage info– Link in the duplicates (dupeis & dupsis)– TODO : link in the indirect duplicates– eis & sis
Data flow
Link eis to sis
ueis.sas7bdat usis.sas7bdat
link.sas7bdat
dupeis2.sas7bdat
4,308,863
dupsis2.sas7bdat
eisid.sas7bdat sisid.sas7bdat
dupeis.sas7bdat
dupsis.sas7bdat
eis.sas7bdat sis.sas7bdat
duplicatesunduplicated
unique records
4,277,40299%
1,888,747
1,638,11287%
31,461
250,635
1,173,404
4,308,86328%
1,888,74774%
1270
493
27% 72%
Type of links
Exact match Approx match (miss) Freq % cum%
ssn, dob, fsound 781094 66.57% 66.57%ssn, fsound dob 52173 4.45% 71.01%ssn dob, fsound 10959 0.93% 71.95%ssn, lsound fname (dob mismatch) 9320 0.79% 72.74%ssn other 7095 0.60% 73.35%dob, fsound, lname (ssn=.) 251124 21.40% 94.75%dob, fsound lname 16189 1.38% 96.13%dob, fsound, lname ssn 23653 2.02% 98.14%dob, fsound, lname (ssn mismatch) 15544 1.32% 99.47%dob, fsound other 4398 0.37% 99.84%fname, lname other 1855 0.16% 100.00%TOTAL 1173404 100.00%
Type of duplicates and links
Type EIS SIS
freq % cum % freq % cum %
DLD 3270 0.08% 0.08% 4345 0.23% 0.23%
DLX 8790 0.20% 0.28% 228039 12.07% 12.30%
DXX 19401 0.45% 0.73% 18251 0.97% 13.27%
PLD 3221 0.07% 0.80% 3221 0.17% 13.44%
PLX 8706 0.20% 1.01% 185066 9.80% 23.24%
PXX 19198 0.45% 1.45% 16929 0.90% 24.14%
XLD 185066 4.30% 5.75% 8706 0.46% 24.60%
XLX 976411 22.66% 28.41% 976411 51.70% 76.29%
XXX 3084800 71.59% 100.00% 447779 23.71% 100.00%
TOT 4308863 100.00% 1888747 100.00%
Number of Duplicates
dups EIS SIS
freq sets % cum % freq sets % cum %
1 4246277 4246277 98.55% 98.55% 1432896 1432896 75.86% 75.86%2 61600 30800 1.43% 99.98% 338251 169125 17.91% 93.77%3 942 314 0.02%100.00% 86379 28793 4.57% 98.35%4 44 11 0.00%100.00% 22928 5732 1.21% 99.56%5 6020 1204 0.32% 99.88%6 1662 277 0.09% 99.97%7 497 71 0.03% 99.99%8 96 12 0.01% 100.00%9 18 2 0.00% 100.00%
TOT 4308863 4277402 100.00% 1888747 1638112 100.00%
Implementation details
Ndob & name must be looped – multiple matches
Too many match on name – use half of ssn– Overlap for transpose
Basic Process
Unduplicate EIS (dupeis) Unduplicate SIS (dupsis) Link unduplicated EIS & SIS (link) Generate unique full list of each set of ids (list)
– use linkage info– Link in the duplicates– eis & sis
Unduplication
Same as matching between different system Except, match the database to itself
– i.e. EIS to EIS, SIS to SIS
Randomly select one as Primary– TODO: for those not linked using primary ID, try
with duplicate ID
TODO: explore indirect duplicate links
Conclusion
Future work : – indirect duplicates– Link using duplicates
SSN have been changed from real data
Thank You !
Type of id
first letter: – P : primary id with duplicates– D : duplicates (primary info given with prefix ‘l’)– X : no duplicates
second letter: link status– L: linked– X: no linked id
third letter: duplicates status of the linked id– D: duplicates exist for the linked id– X: no duplicates for the linked id
EIS & SIS Table
Unique full is of EIS (or SIS) ids Type : type of id (XXX) – see next slide All eis info have no prefix All sis info have prefix ‘k’ Prefix ‘l’ is the link id info freqeis & freqsis : # of duplicate ids Pindid (eis) & pkindid (sis) is the primary id indid1-indid3 & kindid1-kindid8
Link type
sdiff : # digits different in ssn– -1 : one or both ssn is missing– 2 : two digits are transposed– 10 : two digits are different but not transposed
ddiff : diff in dob– -1 : one or both dob is missing– 2 : date and month is transposed– 3 : date, month and year are different– 4 : date and month are different
Fdiff (ldiff) : difference in first (last) name– -1 : one or both are missing– 1 : one letter difference (INDEL or REPL)– 100 : one is a substring of the other– 101 : one letter diff & substring
Duplicate type
If duplicate id– Primary id info is given with prefix “l”– Duplicate type
Lsdiff, lddiff, lfdiff, & lldiff
If primary id– # of duplicates : freqeis & freqsis– Duplicate ids
Indid1-indid3 (eis) & kindid1-kindid8 (sis)
Other tables
Link – Linkage between the primary eis & sis ids
dupeis & dupsis– List of duplicates with primary id
Data flow
eisid: 4,308,863– ueis (4,277,402)+dupeis (31,461) : 99%
sisid: 1,888,747– usis (1,638,112)+dupsis (250,635) : 87%
Link : 1,173,404 (eis: 27%, sis: 72%)– dupeis2 (1,270) + dupsis2(493)
EIS: 4,308,863 (28%) SIS: 1,888,747 (74%)