25
An introduction to fastLink for probabilistic record linkage Anders Alexandersson Florida Cancer Data System 1

An Introduction to Faslink for Probabilistic Record ... · $nobs.a [1] 500 $nobs.b [1] 350 $varnames ... Correctly Clasified (%) 99.91 F1 Score (%) 99.70 10. fastLink details 11

  • Upload
    lamnhu

  • View
    271

  • Download
    1

Embed Size (px)

Citation preview

An introduction to fastLink for probabilistic

record linkage

Anders Alexandersson

Florida Cancer Data System

1

Introduction

2

Why fastLink?

R package for fast probabilistic record linkage

• Correct: Outperforms deterministic methods

• Fast: Exponentially faster than R package RecordLinkage

• Reproducible: Easily integrated with R (D’oh!) and Stata –unlike Match*Pro

• Measures the linkage quality: Creates “confusion tables” –unlike Match*Pro

3

Syntax

fastLink(dfA, dfB, varnames [, options])

Argument Description

-------- ----------------------------------------

dfA Dataset A - to be matched with Dataset B

dfB Dataset B - to be matched with Dataset A

varnames Names of matching variables

options Optional arguments, e.g., return.all

4

A simple example

> library(fastLink)

> data(samplematch)

>

> fastLink(

+ dfA = dfA, dfB = dfB,

+ varnames = c("firstname", "lastname", "city", "birthyear"),

+ )

====================

fastLink(): Fast Probabilistic Record Linkage

====================

If you set return.all to FALSE, you will not be able to calculate a

confusion table as a summary statistic.

Calculating matches for each variable.

Getting counts for zeta parameters.

Parallelizing calculation using OpenMP. 1 threads out of 8 are used.

Running the EM algorithm.

Getting the indices of estimated matches.

Parallelizing calculation using OpenMP. 1 threads out of 8 are used.

Deduping the estimated matches.

Getting the match patterns for each estimated match.

$matches

inds.a inds.b

1 64 3

2 153 6

3 475 13

4 460 18

5 85 19

6 351 35

7 210 39

8 195 45

9 207 52

10 90 53

11 33 54

12 193 62

13 392 82

14 68 83

15 18 87

16 63 94

17 118 95

18 436 100

19 206 106

20 277 115

21 96 120

22 77 128

23 344 135

24 335 150

25 341 151

26 444 152

27 308 153

28 133 160

29 160 164

30 402 166

31 440 168

32 468 172

33 114 189

34 97 194

35 137 197

36 226 221

37 425 224

38 135 250

39 493 252

40 15 272

41 236 279

42 389 285

43 495 295

44 78 297

45 303 298

46 12 301

47 357 308

48 79 320

49 270 326

50 343 341

$EM

$zeta.j

[,1]

[1,] 8.241293783702931e-115

[2,] 1.576903714248490e-76

[3,] 1.801356394064965e-109

[4,] 9.084277932577788e-78

[5,] 1.738201790776579e-39

[6,] 1.985613250637094e-72

[7,] 7.196413447391537e-77

[8,] 1.376974464482844e-38

[9,] 1.572969696023866e-71

[10,] 7.932519042474204e-40

[11,] 1.317803124627868e-01

[12,] 9.999698587154621e-01

$p.m

[1] 0.0002879647649894942

$p.u

[1] 0.9997120352350105

$p.gamma.k.m

$p.gamma.k.m[[1]]

[1] 3.856557471491044e-39 1.000000000000000e+00

$p.gamma.k.m[[2]]

[1] 0.007845026008209893 0.992154973991790090

$p.gamma.k.m[[3]]

[1] 6.284580919724891e-39 1.000000000000000e+00

$p.gamma.k.m[[4]]

[1] 7.243393645774638e-39 1.000000000000000e+00

$p.gamma.k.u

$p.gamma.k.u[[1]]

[1] 0.98913054903181008 0.01086945096818984

$p.gamma.k.u[[2]]

[1] 0.9994283982156022539 0.0005716017843976653

$p.gamma.k.u[[3]]

[1] 0.8836944728712028 0.1163055271287971

$p.gamma.k.u[[4]]

[1] 0.98807881759878657 0.01192118240121335

$p.gamma.j.m

[,1]

[1,] 2.469393252522007e-111

[2,] 5.192218690156331e-75

[3,] 3.086998427878787e-109

[4,] 3.582476835073780e-75

[5,] 7.532620882123411e-39

[6,] 4.478468687192776e-73

[7,] 2.601583387690229e-75

[8,] 5.470165546038261e-39

[9,] 3.252249847039596e-73

[10,] 3.774251918520495e-39

[11,] 7.935852798126154e-03

[12,] 9.920641472018739e-01

$p.gamma.j.u

[,1]

[1,] 8.631814480247366e-01

[2,] 9.485386606963669e-03

[3,] 4.936782562997457e-04

[4,] 1.136057241939053e-01

[5,] 1.248398256483840e-03

[6,] 6.497437584421962e-05

[7,] 1.041427114956497e-02

[8,] 1.144410463285825e-04

[9,] 5.956220715256444e-06

[10,] 1.370651348688530e-03

[11,] 1.506190613273500e-05

[12,] 8.614336618534433e-09

$patterns.w

gamma.1 gamma.2 gamma.3 gamma.4 counts weights

1 0 0 0 0 151009 -254.535842491264674

8 2 0 0 0 1666 -166.388717958840118

5 0 2 0 0 89 -242.240949344099363

3 0 0 2 0 19880 -169.242806175054739

10 2 0 2 0 210 -81.095681642630240

7 0 2 2 0 10 -156.947913027889427

2 0 0 0 2 1814 -167.173183532066219

9 2 0 0 2 23 -79.026058999641691

6 0 2 0 2 1 -154.878290384900907

4 0 0 2 2 245 -81.880147215856326

11 2 0 2 2 3 6.266977316568216

12 2 2 2 2 50 18.561870463733534

p.gamma.j.m p.gamma.j.u

1 2.469393252522007e-111 8.631814480247366e-01

8 5.192218690156331e-75 9.485386606963669e-03

5 3.086998427878787e-109 4.936782562997457e-04

3 3.582476835073780e-75 1.136057241939053e-01

10 7.532620882123411e-39 1.248398256483840e-03

7 4.478468687192776e-73 6.497437584421962e-05

2 2.601583387690229e-75 1.041427114956497e-02

9 5.470165546038261e-39 1.144410463285825e-04

6 3.252249847039596e-73 5.956220715256444e-06

4 3.774251918520495e-39 1.370651348688530e-03

11 7.935852798126154e-03 1.506190613273500e-05

12 9.920641472018739e-01 8.614336618534433e-09

$iter.converge

[1] 24

$nobs.a

[1] 500

$nobs.b

[1] 350

$varnames

[1] "firstname" "lastname" "city" "birthyear"

attr(,"class")

[1] "fastLink" "fastLink.EM"

$patterns

gamma.1 gamma.2 gamma.3 gamma.4

1 2 2 2 2

2 2 2 2 2

3 2 2 2 2

4 2 2 2 2

5 2 2 2 2

6 2 2 2 2

7 2 2 2 2

8 2 2 2 2

9 2 2 2 2

10 2 2 2 2

11 2 2 2 2

12 2 2 2 2

13 2 2 2 2

14 2 2 2 2

15 2 2 2 2

16 2 2 2 2

17 2 2 2 2

18 2 2 2 2

19 2 2 2 2

20 2 2 2 2

21 2 2 2 2

22 2 2 2 2

23 2 2 2 2

24 2 2 2 2

25 2 2 2 2

26 2 2 2 2

27 2 2 2 2

28 2 2 2 2

29 2 2 2 2

30 2 2 2 2

31 2 2 2 2

32 2 2 2 2

33 2 2 2 2

34 2 2 2 2

35 2 2 2 2

36 2 2 2 2

37 2 2 2 2

38 2 2 2 2

39 2 2 2 2

40 2 2 2 2

41 2 2 2 2

42 2 2 2 2

43 2 2 2 2

44 2 2 2 2

45 2 2 2 2

46 2 2 2 2

47 2 2 2 2

48 2 2 2 2

49 2 2 2 2

50 2 2 2 2

$posterior

[1] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621

[5] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621

[9] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621

[13] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621

[17] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621

[21] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621

[25] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621

[29] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621

[33] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621

[37] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621

[41] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621

[45] 0.9999698587154621 0.9999698587154621 0.9999698587154621 0.9999698587154621

[49] 0.9999698587154621 0.9999698587154621

$nobs.a

[1] 500

$nobs.b

[1] 350

attr(,"class")

[1] "fastLink"

5

Correct: Outperforms deterministic methods

Advantage of probabilistic record linkage:

Can use poor-quality matching variables such as address, DOB,gender and name

Disadvantage:

Must estimate match and non-match values (“m- and u-values”)

6

Fast: Exponentially faster than R package RecordLinkage

1) Get Match Patterns

Hashing: Compute match probability for each of the observedagreement patterns (not for all possible agreement patterns)

2) Count Unique Match Patterns

Parallelization: Count the patterns in parallel using OpenMP

3) Get Matched Pairs

Implemented in C++

7

Reproducible: Easily integrated with R and Stata – unlike

Match*Pro

fastLink is a regular R package.

The R package haven loads and saves Stata datasets.

Pre- and post-processing can be done in R or Stata.

Markdown and LaTeX let you combine code with annotations.

This presentation is done in Stata Markdown with the commandmarkstat.

8

Measures the linkage quality: Create a confusion table

> out <- fastLink(

+ dfA = dfA, dfB = dfB,

+ varnames = c("firstname", "lastname", "city", "birthyear"),

+ return.all = TRUE

+ )

====================

fastLink(): Fast Probabilistic Record Linkage

====================

Calculating matches for each variable.

Getting counts for zeta parameters.

Parallelizing calculation using OpenMP. 1 threads out of 8 are used.

Running the EM algorithm.

Getting the indices of estimated matches.

Parallelizing calculation using OpenMP. 1 threads out of 8 are used.

Deduping the estimated matches.

Getting the match patterns for each estimated match.

9

Show the confusion table

> confusion(out)

$confusion.table

´True´ Matches ´True´ Non-Matches

Declared Matches 50.0 0.0

Declared Non-Matches 0.3 299.7

$addition.info

results

Max Number of Obs to be Matched 350.00

Sensitivity (%) 99.41

Specificity (%) 100.00

Positive Predicted Value (%) 100.00

Negative Predicted Value (%) 99.90

False Positive Rate (%) 0.00

False Negative Rate (%) 0.59

Correctly Clasified (%) 99.91

F1 Score (%) 99.70

10

fastLink details

11

Fellegi-Sunter model

Ivan Fellegi and Alan Sunter

“A Theory for Record Linkage”(1969)

Calculate amatch weight =ln(m/u)/ln(2) or anon-match weight =ln((1-m)/(1-u))/ln(2).

Figure 1: Ivan Fellegi

12

fastLink step by step (optional functions)

1) Calculate agreement variable by variable: gammapar*()

2) Count unique agreement patterns: tableCounts()

3) Calculate Fellegi-Sunter weights: emlinkMARmov()

4) Find the matches: matchesLink()

5) Dedupe the matches: dedupeMatches()

13

Meet the developers from Princeton University

• Ted Enamorado; PhD 2018(Expected) in Politics

• Ben Fifield; PhD 2018(Expected) in Politics

• Kosuke Imai; Professor inPolitics and Center forStatistics and MachineLearning

Figure 2: Kosuke Imai

14

TBT: Bayes’ Theorem

Prior Data≠≠≠æ Posterior

Bayes’ theorem (1763) is usedto update our beliefs.

P(A|B) Ã P(A) ú P(B|A)

The posterior is proportional tothe prior times the likelihood.

Figure 3: Thomas Bayes?

15

A complex example: Read Stata datasets in R

The R package haven reads Stata datasets:

> library(haven)

>

> patient_male <- read_dta("patient_male.dta")

> requestor_male <- read_dta("requestor_male.dta")

>

> patient_other <- read_dta("patient_other.dta")

> requestor_other <- read_dta("requestor_other.dta")

16

A complex example: Linkage for females, and hide the output

> hide <- capture.output(out2 <- fastLink(

+ dfA = patient_other, dfB = requestor_other,

+ varnames = c("ssn", "zip", "stnum", "dob", "dblmf", "dblml"),

+ return.all = TRUE))

The R function capture.output() hides the default output.

The variable constructs are SSN + (Address + DOB + Names).

For example, “dblmf” is Double Metaphone of First Name.

17

A complex example: Confusion table for females

There are 658 links (declared matches) for females out of 10,603possible links at the default matching threshold of 0.85 (not shown).The default threshold is for research, not for sevice provision. Thereare 646 links at the higher threshold of 0.98:

> confusion(out2, threshold = 0.98)

$confusion.table

´True´ Matches ´True´ Non-Matches

Declared Matches 645.95 0.05

Declared Non-Matches 9.31 9947.69

$addition.info

results

Max Number of Obs to be Matched 10603.00

Sensitivity (%) 98.58

Specificity (%) 100.00

Positive Predicted Value (%) 99.99

Negative Predicted Value (%) 99.91

False Positive Rate (%) 0.00

False Negative Rate (%) 1.42

Correctly Clasified (%) 99.91

F1 Score (%) 99.28 18

Extract the matches as R dataframes

Slide 5, “A simple example”, ended with:

$matches

inds.a inds.b

1 64 3

2 153 6

There were 50 declared matches (links) at the default matchingthreshold 0.85.

How to extract the matches as R dataframes:

> matchesA_other <- patient_other[out2$matches$inds.a,]

> matchesB_other <- requestor_other[out2$matches$inds.b,]

19

Post-processing

20

Create variable of matching pattern

The getPatterns() function stores the matching pattern as stringvariable $patterns. In most cases it is not needed to split thestring variable.

> matches_other <- matchesB_other

> matches_other$pattern <- do.call(paste, out2$patterns)

21

Add posterior and pid variables, and subset to matches

The posterior probabilities are stored in $posterior. The datasetshould have the PID variables from both source datasets. Subset tomatches at or above matching threshold 0.98.

> matches_other$posterior <- out2$posterior

> matches_other$patient_id_number_n20 <- matchesA_other$pid

> matches_other <- matches_other[out2$posterior >= .98,

+ c("patient_id_number_n20", "pid", "posterior", "pattern")]

22

Create a combined Stata dataset

Save R dataframe with matched females as Stata dataset:

> save(matches_other, file="matches_other.Rda")

> tmp <-"matches_other.dta"

> write_dta(matches_other, tmp)

Repeat for males (not shown), and then combine in Stata:

. use matches_male, clear

. qui append using matches_other

. qui save matches, replace

23

Determine the “possibles”

Add confusion tables with di�erent matching thresholds (posteriors)to do a sensitivity analysis.

Assume we are interested in posteriors 0.98-0.99 (aka credible range98-99%). Then we can tabulate the matching patterns:

. use matches, clear

. tab pattern if inrange( posterior, 0.98, 0.99 ), mi

pattern Freq. Percent Cum.

0 2 2 0 2 2 272 95.10 95.10

2 0 0 0 0 2 11 3.85 98.95

2 NA NA 0 0 2 2 0.70 99.65

NA 2 NA 2 2 0 1 0.35 100.00

Total 286 100.00

So there are 286 “possibles” which may require clerical review.

24

TODO

• Vignette with many practical examples

• Aggregated confusion tables

• Better blocking features

25