Upload
lamnhu
View
271
Download
1
Embed Size (px)
Citation preview
An introduction to fastLink for probabilistic
record linkage
Anders Alexandersson
Florida Cancer Data System
1
Why fastLink?
R package for fast probabilistic record linkage
• Correct: Outperforms deterministic methods
• Fast: Exponentially faster than R package RecordLinkage
• Reproducible: Easily integrated with R (D’oh!) and Stata –unlike Match*Pro
• Measures the linkage quality: Creates “confusion tables” –unlike Match*Pro
3
Syntax
fastLink(dfA, dfB, varnames [, options])
Argument Description
-------- ----------------------------------------
dfA Dataset A - to be matched with Dataset B
dfB Dataset B - to be matched with Dataset A
varnames Names of matching variables
options Optional arguments, e.g., return.all
4
A simple example
> library(fastLink)
> data(samplematch)
>
> fastLink(
+ dfA = dfA, dfB = dfB,
+ varnames = c("firstname", "lastname", "city", "birthyear"),
+ )
====================
fastLink(): Fast Probabilistic Record Linkage
====================
If you set return.all to FALSE, you will not be able to calculate a
confusion table as a summary statistic.
Calculating matches for each variable.
Getting counts for zeta parameters.
Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
Running the EM algorithm.
Getting the indices of estimated matches.
Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
Deduping the estimated matches.
Getting the match patterns for each estimated match.
$matches
inds.a inds.b
1 64 3
2 153 6
3 475 13
4 460 18
5 85 19
6 351 35
7 210 39
8 195 45
9 207 52
10 90 53
11 33 54
12 193 62
13 392 82
14 68 83
15 18 87
16 63 94
17 118 95
18 436 100
19 206 106
20 277 115
21 96 120
22 77 128
23 344 135
24 335 150
25 341 151
26 444 152
27 308 153
28 133 160
29 160 164
30 402 166
31 440 168
32 468 172
33 114 189
34 97 194
35 137 197
36 226 221
37 425 224
38 135 250
39 493 252
40 15 272
41 236 279
42 389 285
43 495 295
44 78 297
45 303 298
46 12 301
47 357 308
48 79 320
49 270 326
50 343 341
$EM
$zeta.j
[,1]
[1,] 8.241293783702931e-115
[2,] 1.576903714248490e-76
[3,] 1.801356394064965e-109
[4,] 9.084277932577788e-78
[5,] 1.738201790776579e-39
[6,] 1.985613250637094e-72
[7,] 7.196413447391537e-77
[8,] 1.376974464482844e-38
[9,] 1.572969696023866e-71
[10,] 7.932519042474204e-40
[11,] 1.317803124627868e-01
[12,] 9.999698587154621e-01
$p.m
[1] 0.0002879647649894942
$p.u
[1] 0.9997120352350105
$p.gamma.k.m
$p.gamma.k.m[[1]]
[1] 3.856557471491044e-39 1.000000000000000e+00
$p.gamma.k.m[[2]]
[1] 0.007845026008209893 0.992154973991790090
$p.gamma.k.m[[3]]
[1] 6.284580919724891e-39 1.000000000000000e+00
$p.gamma.k.m[[4]]
[1] 7.243393645774638e-39 1.000000000000000e+00
$p.gamma.k.u
$p.gamma.k.u[[1]]
[1] 0.98913054903181008 0.01086945096818984
$p.gamma.k.u[[2]]
[1] 0.9994283982156022539 0.0005716017843976653
$p.gamma.k.u[[3]]
[1] 0.8836944728712028 0.1163055271287971
$p.gamma.k.u[[4]]
[1] 0.98807881759878657 0.01192118240121335
$p.gamma.j.m
[,1]
[1,] 2.469393252522007e-111
[2,] 5.192218690156331e-75
[3,] 3.086998427878787e-109
[4,] 3.582476835073780e-75
[5,] 7.532620882123411e-39
[6,] 4.478468687192776e-73
[7,] 2.601583387690229e-75
[8,] 5.470165546038261e-39
[9,] 3.252249847039596e-73
[10,] 3.774251918520495e-39
[11,] 7.935852798126154e-03
[12,] 9.920641472018739e-01
$p.gamma.j.u
[,1]
[1,] 8.631814480247366e-01
[2,] 9.485386606963669e-03
[3,] 4.936782562997457e-04
[4,] 1.136057241939053e-01
[5,] 1.248398256483840e-03
[6,] 6.497437584421962e-05
[7,] 1.041427114956497e-02
[8,] 1.144410463285825e-04
[9,] 5.956220715256444e-06
[10,] 1.370651348688530e-03
[11,] 1.506190613273500e-05
[12,] 8.614336618534433e-09
$patterns.w
gamma.1 gamma.2 gamma.3 gamma.4 counts weights
1 0 0 0 0 151009 -254.535842491264674
8 2 0 0 0 1666 -166.388717958840118
5 0 2 0 0 89 -242.240949344099363
3 0 0 2 0 19880 -169.242806175054739
10 2 0 2 0 210 -81.095681642630240
7 0 2 2 0 10 -156.947913027889427
2 0 0 0 2 1814 -167.173183532066219
9 2 0 0 2 23 -79.026058999641691
6 0 2 0 2 1 -154.878290384900907
4 0 0 2 2 245 -81.880147215856326
11 2 0 2 2 3 6.266977316568216
12 2 2 2 2 50 18.561870463733534
p.gamma.j.m p.gamma.j.u
1 2.469393252522007e-111 8.631814480247366e-01
8 5.192218690156331e-75 9.485386606963669e-03
5 3.086998427878787e-109 4.936782562997457e-04
3 3.582476835073780e-75 1.136057241939053e-01
10 7.532620882123411e-39 1.248398256483840e-03
7 4.478468687192776e-73 6.497437584421962e-05
2 2.601583387690229e-75 1.041427114956497e-02
9 5.470165546038261e-39 1.144410463285825e-04
6 3.252249847039596e-73 5.956220715256444e-06
4 3.774251918520495e-39 1.370651348688530e-03
11 7.935852798126154e-03 1.506190613273500e-05
12 9.920641472018739e-01 8.614336618534433e-09
$iter.converge
[1] 24
$nobs.a
[1] 500
$nobs.b
[1] 350
$varnames
[1] "firstname" "lastname" "city" "birthyear"
attr(,"class")
[1] "fastLink" "fastLink.EM"
$patterns
gamma.1 gamma.2 gamma.3 gamma.4
1 2 2 2 2
2 2 2 2 2
3 2 2 2 2
4 2 2 2 2
5 2 2 2 2
6 2 2 2 2
7 2 2 2 2
8 2 2 2 2
9 2 2 2 2
10 2 2 2 2
11 2 2 2 2
12 2 2 2 2
13 2 2 2 2
14 2 2 2 2
15 2 2 2 2
16 2 2 2 2
17 2 2 2 2
18 2 2 2 2
19 2 2 2 2
20 2 2 2 2
21 2 2 2 2
22 2 2 2 2
23 2 2 2 2
24 2 2 2 2
25 2 2 2 2
26 2 2 2 2
27 2 2 2 2
28 2 2 2 2
29 2 2 2 2
30 2 2 2 2
31 2 2 2 2
32 2 2 2 2
33 2 2 2 2
34 2 2 2 2
35 2 2 2 2
36 2 2 2 2
37 2 2 2 2
38 2 2 2 2
39 2 2 2 2
40 2 2 2 2
41 2 2 2 2
42 2 2 2 2
43 2 2 2 2
44 2 2 2 2
45 2 2 2 2
46 2 2 2 2
47 2 2 2 2
48 2 2 2 2
49 2 2 2 2
50 2 2 2 2
$posterior
[1] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621
[5] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621
[9] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621
[13] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621
[17] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621
[21] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621
[25] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621
[29] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621
[33] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621
[37] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621
[41] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621
[45] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621
[49] 0.9999698587154621 0.9999698587154621
$nobs.a
[1] 500
$nobs.b
[1] 350
attr(,"class")
[1] "fastLink"
5
Correct: Outperforms deterministic methods
Advantage of probabilistic record linkage:
Can use poor-quality matching variables such as address, DOB,gender and name
Disadvantage:
Must estimate match and non-match values (“m- and u-values”)
6
Fast: Exponentially faster than R package RecordLinkage
1) Get Match Patterns
Hashing: Compute match probability for each of the observedagreement patterns (not for all possible agreement patterns)
2) Count Unique Match Patterns
Parallelization: Count the patterns in parallel using OpenMP
3) Get Matched Pairs
Implemented in C++
7
Reproducible: Easily integrated with R and Stata – unlike
Match*Pro
fastLink is a regular R package.
The R package haven loads and saves Stata datasets.
Pre- and post-processing can be done in R or Stata.
Markdown and LaTeX let you combine code with annotations.
This presentation is done in Stata Markdown with the commandmarkstat.
8
Measures the linkage quality: Create a confusion table
> out <- fastLink(
+ dfA = dfA, dfB = dfB,
+ varnames = c("firstname", "lastname", "city", "birthyear"),
+ return.all = TRUE
+ )
====================
fastLink(): Fast Probabilistic Record Linkage
====================
Calculating matches for each variable.
Getting counts for zeta parameters.
Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
Running the EM algorithm.
Getting the indices of estimated matches.
Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
Deduping the estimated matches.
Getting the match patterns for each estimated match.
9
Show the confusion table
> confusion(out)
$confusion.table
´True´ Matches ´True´ Non-Matches
Declared Matches 50.0 0.0
Declared Non-Matches 0.3 299.7
$addition.info
results
Max Number of Obs to be Matched 350.00
Sensitivity (%) 99.41
Specificity (%) 100.00
Positive Predicted Value (%) 100.00
Negative Predicted Value (%) 99.90
False Positive Rate (%) 0.00
False Negative Rate (%) 0.59
Correctly Clasified (%) 99.91
F1 Score (%) 99.70
10
Fellegi-Sunter model
Ivan Fellegi and Alan Sunter
“A Theory for Record Linkage”(1969)
Calculate amatch weight =ln(m/u)/ln(2) or anon-match weight =ln((1-m)/(1-u))/ln(2).
Figure 1: Ivan Fellegi
12
fastLink step by step (optional functions)
1) Calculate agreement variable by variable: gammapar*()
2) Count unique agreement patterns: tableCounts()
3) Calculate Fellegi-Sunter weights: emlinkMARmov()
4) Find the matches: matchesLink()
5) Dedupe the matches: dedupeMatches()
13
Meet the developers from Princeton University
• Ted Enamorado; PhD 2018(Expected) in Politics
• Ben Fifield; PhD 2018(Expected) in Politics
• Kosuke Imai; Professor inPolitics and Center forStatistics and MachineLearning
Figure 2: Kosuke Imai
14
TBT: Bayes’ Theorem
Prior Data≠≠≠æ Posterior
Bayes’ theorem (1763) is usedto update our beliefs.
P(A|B) Ã P(A) ú P(B|A)
The posterior is proportional tothe prior times the likelihood.
Figure 3: Thomas Bayes?
15
A complex example: Read Stata datasets in R
The R package haven reads Stata datasets:
> library(haven)
>
> patient_male <- read_dta("patient_male.dta")
> requestor_male <- read_dta("requestor_male.dta")
>
> patient_other <- read_dta("patient_other.dta")
> requestor_other <- read_dta("requestor_other.dta")
16
A complex example: Linkage for females, and hide the output
> hide <- capture.output(out2 <- fastLink(
+ dfA = patient_other, dfB = requestor_other,
+ varnames = c("ssn", "zip", "stnum", "dob", "dblmf", "dblml"),
+ return.all = TRUE))
The R function capture.output() hides the default output.
The variable constructs are SSN + (Address + DOB + Names).
For example, “dblmf” is Double Metaphone of First Name.
17
A complex example: Confusion table for females
There are 658 links (declared matches) for females out of 10,603possible links at the default matching threshold of 0.85 (not shown).The default threshold is for research, not for sevice provision. Thereare 646 links at the higher threshold of 0.98:
> confusion(out2, threshold = 0.98)
$confusion.table
´True´ Matches ´True´ Non-Matches
Declared Matches 645.95 0.05
Declared Non-Matches 9.31 9947.69
$addition.info
results
Max Number of Obs to be Matched 10603.00
Sensitivity (%) 98.58
Specificity (%) 100.00
Positive Predicted Value (%) 99.99
Negative Predicted Value (%) 99.91
False Positive Rate (%) 0.00
False Negative Rate (%) 1.42
Correctly Clasified (%) 99.91
F1 Score (%) 99.28 18
Extract the matches as R dataframes
Slide 5, “A simple example”, ended with:
$matches
inds.a inds.b
1 64 3
2 153 6
There were 50 declared matches (links) at the default matchingthreshold 0.85.
How to extract the matches as R dataframes:
> matchesA_other <- patient_other[out2$matches$inds.a,]
> matchesB_other <- requestor_other[out2$matches$inds.b,]
19
Create variable of matching pattern
The getPatterns() function stores the matching pattern as stringvariable $patterns. In most cases it is not needed to split thestring variable.
> matches_other <- matchesB_other
> matches_other$pattern <- do.call(paste, out2$patterns)
21
Add posterior and pid variables, and subset to matches
The posterior probabilities are stored in $posterior. The datasetshould have the PID variables from both source datasets. Subset tomatches at or above matching threshold 0.98.
> matches_other$posterior <- out2$posterior
> matches_other$patient_id_number_n20 <- matchesA_other$pid
> matches_other <- matches_other[out2$posterior >= .98,
+ c("patient_id_number_n20", "pid", "posterior", "pattern")]
22
Create a combined Stata dataset
Save R dataframe with matched females as Stata dataset:
> save(matches_other, file="matches_other.Rda")
> tmp <-"matches_other.dta"
> write_dta(matches_other, tmp)
Repeat for males (not shown), and then combine in Stata:
. use matches_male, clear
. qui append using matches_other
. qui save matches, replace
23
Determine the “possibles”
Add confusion tables with di�erent matching thresholds (posteriors)to do a sensitivity analysis.
Assume we are interested in posteriors 0.98-0.99 (aka credible range98-99%). Then we can tabulate the matching patterns:
. use matches, clear
. tab pattern if inrange( posterior, 0.98, 0.99 ), mi
pattern Freq. Percent Cum.
0 2 2 0 2 2 272 95.10 95.10
2 0 0 0 0 2 11 3.85 98.95
2 NA NA 0 0 2 2 0.70 99.65
NA 2 NA 2 2 0 1 0.35 100.00
Total 286 100.00
So there are 286 “possibles” which may require clerical review.
24