An Identifier Scheme for the Digitising Scotland Project
Alasdair J G GrayDepartment of Computer Science,
Heriot-Watt University, Edinburgh
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
Özgür Akgün, Uni. of St AndrewsAhamd Alsadeeqi, Heriot-Watt Uni.Peter Christen, Australian National Uni.Tom Dalton, Uni. of St AndrewsAlan Dearle, Uni. of St AndrewsChris Dibben, Uni. of EdinburghEilidh Garret, Uni. of EssexGraham Kirby, Uni. of St AndrewsAlice Reid, Uni. of CambridgeLee Williamson, Uni. of Edinburgh
Digitising Scotland Project
Large scale family reconstruction studies and Pedigrees• Transcription of data• Linking of data
Performed at scale• Whole nation• Large timeframe
1 June 2017 ADRN Conference 2
Project Team
Backgrounds
• Demographers • Historians • Computer Scientists
Distributed team
1 June 2017 ADRN Conference 3
St Andrews Cambridge Edinburgh Edinburgh Australia
Transcribing Scotland’s Vital Records: 1855 – 1974
• 24M records• Birth
• Marriage
• Death
• 18M individuals
41 June 2017 ADRN Conference
Data Linkage ChallengesLow quality data
Probabilistic matches
Scalability
Skewed name
distributionsJohn Grant
Fisherman
Fiona Sinclair
Ian Grant
Smithy
Born: 1861
Stuart Adam
Wheelwright
Morag Scott
Flora Adam
Seamstress
Born: 1866
Married: 1884
John Grant
Farmer
Fiona Sinclaire
Iain Grant
Born: 1860
1 June 2017 ADRN Conference 5
Linking Skye Data
1 June 2017 ADRN Conference 6
Discussing records
Eilidh, I’m having problems with the Skye record B-BABY-8293.
Peter, which transcribed certificate is that?
It is the record for Chris Dibben, born 18 March 1893.
That is the child on record 5457. It should link to the death on record
5754, 4 December 1959.
Thanks, found it now. It is record D-DEATH-2182.
1 June 2017 ADRN Conference 7
Existing Identifier Schemes
Historians: Example: 5457
• Incremental integer• Easily confused with other record
types
• Identifies certificate not actors
• Based on order of transcription• Not derived from data
• Unique for a file• Excel spreadsheet
Record Linkage: Example: B-BABY-8293
• Encode type of certificate and actor on certificate
• Four digits generated by linkage process• Different from those used by the
historians
• Different for each run of linkage pre-processing
1 June 2017 ADRN Conference 8
Desiderata for Identifiers1. Identifier for each
actor on a certificate2. Exchangeable between
researchers3. Unique generation
process from the data4. Immutable to data
changes, e.g. typo discovered in data
5. Human derivable from data records
6. Human interpretable
7. Compact to enable efficient computation
8. Susceptible to blocking9. Globally unique
10.Consistent approach for all record types
11.Compatible with pre-existing NRS approach
12.Compatibility with Open Data Standards
1 June 2017 ADRN Conference 9
Identifier Scheme
B1903_164_00_baby
1 June 2017 ADRN Conference 10
typeYear_district_subdistrict_entryNumber_role
Certificate RolesBirth• baby • mother • father • registrar • informant
Marriage• groom • groom_father• groom_mother• bride • bride_father• bride_mother• witness1 • witness2 • officiant • registrar
Death• deceased • mother • father • spouse1…spousen• informant • doctor • registrar
1 June 2017 ADRN Conference 11
Conclusions
• Agreed identifier scheme
typeYear_district_subdistrict_entryNumber_role
• Meets desiderata
• Reliant on “clean” parts of certificate
• Compatible with NRS
• Improved team communications
Alasdair Graywww.macs.hw.ac.uk/~ajg33/
@gray_alasdair
Acknowledgements:• Julia Jennings• Christine Jones• Diego Ramiro-Farinas
1 June 2017 ADRN Conference 12