View
226
Download
5
Category
Preview:
Citation preview
Thomas Hickey
Chief Scientist, OCLC Research
2016 August
Authority Data on the Web
VIAF Reflections
2
3
4
5
6
Personal names
Geographic
Corporate
Title
Family
Events
Everything but concepts are considered in scope
National level, but willing to consider other sources
Scope of VIAF
(2009)
Started with 2 files
• Ed O’Neill’s group gave us DDB & LC matching
• First project: replicate their matching of DDB/LC
• Second: extended to 3 files with BnF
– UNIMARC
– UNICODE
8
Modest beginnings
• First interface retrieved clusters
– A cluster was just a set of 1-3 source IDs
– Showed preferred form for each
• No linked data
• Hardly any merged information
• But enough to be useful
9
Matching
• Pairwise matching
– Still how we do it
– Pairwise links used for evaluation as well
• Map-reduce happened
– Method of distributing processing
– Obtained a cluster• Rocks followed by Pebbles and now Gravel
• 400+ CPUs, terabyte of memory, petabytes of disk
– M-R used for much of OCLC’s processing
10
VIAF
DNB Bib & Authority BnF Bib & Authority LC Bib & Authority
VIAF
~7.5 million personal name authority records
~25 million bibliographic records
~1.2 million links between files
(2008)
Current size of VIAF
• 44 active participants/files
• 55 million source authority records
• 130 million bibliographic records
• 256 million links between sources
• 30 million external links
• 33 million VIAF clusters
Sources
• Still concentrating on national libraries and national/international consortia
• But we have
– Getty ULAN
– Wikipedia (Wikidata)
– Perseus
– Syriac
– xR
13
Communaute ́s europe ́ennes. Cour de justice. Division Bibliothe ̀que
14http://viaf.org/viaf/127884087
Various type of dates
藤原, 長清, 永仁頃 (Reign)
.هـ1111-1037لمجلسي، محمد باقر بن محمد تقي، (Hijri)
Gregorian, pre-Gregorian
http://journal.code4lib.org/articles/9607
15
Various dates
Joan, Clímac, sant, s. VI
Joannes, Climacus, 6e/7e E.
Jean Climaque, saint, 0579?-0649?
Jan III (papież ; -574)
John, Climacus, Saint, 6th cent.
Jean III pape 05..-0574
Johannes Klimakos, helgon, 500-talet
Jan Klimakos, svatý, asi 579-asi 649
16
More date variations
Suetonius, approximately 69-approximately 122
Suetonius Tranquillus, Caius, ca. 69/70-ca. 140
Suetónio, fl. 69-141
Suetonius Tranquillus, Gaius, asi 69-140
Suetonius Tranquillus, Gaius, 69-
Svetonijs, apm. 69-apm. 122
Suetonius Tranquillus, Caius, f. sec. I-II
Suetonius Tranquillus, Caius (ca 70-ca 140)
Suetoni, ca. 69-140
17
Variant forms for FRBR processing
mcshann, jay
mcshann, jay.leader
mcshann, jay.1909 2006
mcshann, jay.1916 2006
wang, xinlian.1960
王新莲
王新莲.1960
王新莲.singer
de lucia, pepe.leader
lucia, pepe de
pepe de lucia
pepe de lucia.1945
christopher, r
christopher, russel
christopher, Russell
christopher, russell.1930
18
Finding Works and Expressions
Production enhancedWorldCat PREVIOUS
FRBR CLUSTERS
VIAFGenerated Authorities (xR)
Works & Expressions
Meta Authorities Full Encoding Series
Overrides
FAST GLIMIR Aud Level LCSH, Genres, MeSH
Work Records
What we did right
• National library participation is critical
• Minimal changes to source data
– Use original IDs
– Original MARC tagging when available
– Flexible on source format, harvest
• Multiple interface languages
• Minimize use of name text for matching
• Linked open data, bulk availability
• Used within OCLC
20
Other options?
• Stick to the idea of ‘virtual’– No VIAF identifier
• Use MARC internally for the clusters– Currently an ad-hoc XML
– Extensions to MARC-21 make it easier
• More JSON, less XML?
• Avoid immature files
• More mathematical approach to clustering
• Avoid ‘|’ in our internal IDs
• Stricter matching
21
Overall
• VIAF has been remarkable success
– Support of participating libraries
– Support of OCLC
– Strong demand
– Emphasis on linked data
• It’s been a privilege to work on it!
22
Recommended