An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of...

Preview:

Citation preview

An Automated Record Linkage System for the Canadian

Census, 1871-1881

L. Antonie (University of Guelph)P. Baskerville (Universities of Alberta and Victoria)

K. Inwood (University of Guelph)J. A. Ross (University of Guelph)

Record Linkage Workshop, May 24th-25th, 2010, University of Guelph

‘Unbiased’ links connecting individuals/households over several

census years

A comprehensive infrastructure of longitudinal data

What we are working towards

1851Census

1871Census

1881Census 1891

Census

1901Census

1906 Census

1916Census

1911Census

US 1880

Census

US 1900

Census

Current Work

100% of 1871

CensusAutomatic LinkingAutomatic Linking

4,277,807 records

3,601,663 records

Partners and collaborators: FamilySearch, Church of Latter Day Saints, Minnesota Population Center, Université de Montréal, University of Alberta

100% of 1871

Census

100% of 1871

Census

100% of 1881

Census

100% of 1871

Census

Existing (True) Links

• Ontario Industrial Proprietors – 8429 links

• Logan Township – 1760 links

• St. James Church, Toronto – 232 links

• Quebec City Boys – 1403 links

• Bias– family- context– others?

Logan Twp

Guelph

Attributes for Automatic Linking

• Last Name - string

• First Name - string

• Gender – binary

• Age - number

• Birthplace - number

• Marital status – single, married, divorced, widowed, unknown

Automatic Linkage

• The challenges:1) Identify the same person2) Deal with attribute characteristics3) Manage computational expense

• The system:

Data Cleaning and Standardization• Cleaning

– Names – remove non-alpha numerical characters; remove titles

– Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);

– All attributes - deal with English/French notations (e.g. days/jours, married/mariee)

• Standardization– Birthplace codes and granularity– Marital status

Computational Expense

• Very expensive to compare all the possible pairs of records

• Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)

• Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days.

Managing Computational Expense

• Blocking – By first letter of last name– By birthplace

• Using HPC– Running the system on multiple processors

Record Comparison

• Comparing Strings– Jaro-Winkler– Edit Distance– Double Metaphone

• Age– +/- 2 years

• Exact matches – Gender– Birthplace

Classification

• Classifier – Support Vector Machines– 5-fold cross validation

• Training Data– True links found by experts– Ontario proprietors

• Classes– Match– Non-match

Linkage Results

Province Linkage Rate (%)

New Brunswick 24.45

Nova Scotia 21.50

Ontario 18.36

Quebec 17.45

Linkage Results - EvaluationTrue Links Set Total TP (%) FP (%)

Ontario_Props 1647 21.59 9.28

Logan 1760 21.64 8.85

St_James 232 24.72 7.12

Les_Boys 1403 17.99 11.41

Province TP FP Possible Unsure

New Brunswick 66 27 6 1

Nova Scotia 70 22 5 -

Ontario 53 40 5 2

Quebec 42 52 6 -

Linkage Results - EvaluationAttribute ON71 QC71 CAN81 ON_Props Linked(ON) Linked(QC)

Gender Distribution

Female 47.46 49.83 49.35 48.63 45.26 43.50

Male 49.69 50.00 50.64 51.33 54.74 56.50

Age

0-15 42.20 41.84 38.68 60.28 40.96 43.24

15-25 20.12 20.72 21.22 9.44 20.70 22.56

25-50 26.42 25.78 27.68 31.35 26.95 23.07

>50 11.26 11.66 12.42 8.93 11.39 11.13

Birthplace

ON (15030) 67.29 0.57 34.04 73.24 66.30 0.48

QC (15081) 2.45 91.71 30.70 2.40 2.57 92.08

ENG (41000) 7.44 1.11 4.02 6.74 10.00 1.37

IRE (41100) 5.48 0.98 2.75 5.84 5.40 0.94

SCO (41400) 9.35 3.17 4.45 7.33 8.57 2.83

GER (45300) 1.23 0.06 0.56 1.12 2.10 0.07

USA (9900) 2.59 1.23 1.77 2.19 3.96 1.72

Marital Status

Married (1) 30.36 30.22 31.78 39.75 29.11 23.13

Widowed (5) 3.21 3.02 3.66 0.86 4.07 3.64

Single (6) 66.43 66.75 64.52 59.39 66.82 73.24

Directions to Improve

• Common patterns in incorrect links– Big age difference– Change in marital status for females– First name change

• Probability estimate score of the classifier

BeforeBefore

Results – Common Patterns

AfterAfter

Province Linkage Rate (%)

New Brunswick 24.45

Nova Scotia 21.50

Ontario 18.36

Quebec 17.45

Province Linkage Rate (%) Diff.

NB 22.24 -2.21

NS 18.72 -2.78

ON 15.68 -2.68

QC 14.82 -2.63

Results – Common Patterns

BeforeBefore

AfterAfter

True Links Set Total TP (%) FP (%)

Ontario_Props 1647 21.59 9.28

Logan 1760 21.64 8.85

St_James 232 24.72 7.12

Les_Boys 1403 17.99 11.41

Set TP (%) TPDiff. FP (%) FPDiff.

O_P 20.48 -1.11 7.32 -1.96

L 20.36 -1.28 7.25 -1.6

St_J 23 -1.72 5.92 -1.2

L_B 16.66 -1.33 10.36 -1.05

Results – Classification Scores

0.80.8

0.850.85

0.90.9

22.06 Total TP (%) FP (%)

Logan 1760 19.37 4.86

St_James 232 22.06 3.43

Les_Boys 1403 15.25 5.94

True Links Set Total TP (%) FP (%)

Logan 1760 18.97 4.61

St_James 232 22.06 3

Les_Boys 1403 14.64 5.31

True Links Set Total TP (%) FP (%)

Logan 1760 18.125 3.78

St_James 232 21.63 2.4

Les_Boys 1403 13.94 3.97

Conclusions

• Linking people across 1871-1881 Canadian censuses

• Preliminary automated linkage system

• More evaluation and experimentation is needed

Acknowledgements

• University of Guelph

• Ontario Ministry of Research and Innovation

• SHARCNET

• FamilySearch, Church of Latter Day Saints

• Minnesota Population Center

• University of Alberta

• Université de Montréal/PRDH

• Université Laval/CIEQ

Recommended