40
FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Embed Size (px)

Citation preview

Page 1: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

FRBR: Algorithms and Applications

T. HickeyJ. TovesD. Vizine-Goetz

Online Compuer Library Center

CLA November 2004

Page 2: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Outline

Algorithms• FRBR work matching• Handling author-title variants

Hardware• Beowulf cluster

Applications• Bookmarklets• FictionFinder

Future directions

Page 3: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Working with Group 1 Entities

WEMI:Work

ExpressionManifestation

Item

Strict expression-level determination is hard• We primarily divide by language

Manifestation is easier• We use the WorldCat master record

Page 4: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Work Identification

Algorithm goals:• Efficient• Understandable• Controllable by catalogers• Uses existing WorldCat records

Page 5: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

The Algorithm

A key is generated for each record Extract author, title

• Look up in LC name authority file• Added entry information as needed

Form a key from bibliographic record• Author, title, added entry information• These can be sorted, compared

Page 6: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Example

146 Smollett\1721 Expedition of Humphry Clinker

16 Smollett\1721 Expedition of Humphrey Clinker

8 Smollett\1721 Humphry Clinker

4 Smollett\1721 Humphrey Clinker

2 Smollett\1721 Expedition of Humphry Clinker

1 Smollett\1721 Calatoriile lui Humphrey Clinker

1 Smollet\1721 Expedition of Humphry Clinker

1 Smollett Humphry Klinkers Reisen

Page 7: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Example (with authorities)

156 Smollett\1721 Expedition of Humphry Clinker

16 Smollett\1721 Expedition of Humphrey Clinker

4 Smollett\1721 Humphrey Clinker

1 Smollett\1721 Calatoriile lui Humphrey Clinker

1 Smollet\1721 Expedition of Humphry Clinker

1 Smollett\1721 Humphry Klinkers Reisen

Page 8: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

More Detail

Extract author names• Look up in authority file

• Currently only personal names• Subfields $abcdq

Extract title• Always use uniform titles if present• Look up author/short title (~$a)• Look up author/long title (~$abfgnp)• Prefer alternative title for non-English

Create key from author/title• Always do NACO normalization (has limitations)• Add information for uncontrolled title-main-entry

Page 9: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Authority Files Rule!

Authors Author/titles

Bring together variations Allow override in difficult cases

• Both splitting and joining groups• Especially important with xISBN matching

Especially important with non-English metadata

Page 10: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Limitations of the Authority File

What’s missing:• Many uniform titles• Many author variants• Many title variants• Language of heading

Partial solution• Create auxiliary files of mechanically generated matches

Page 11: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Results of FRBR Matching on WorldCat

88% of manifestations are ‘singletons’ 30% of manifestations are in 12% of the works Average size of multiple matches: 3.1 manifestations/work 43.1 million works in 54 million manifestations 54% of holdings on a FRBR work with >1 manifestation WorldCat manifestations average about 20 holdings

FRBR helps where help is most needed

Page 12: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

More FRBR Results

310,000 works have more than 5 manifestations 1.7 million have more than 2 manifestations

Largest: 30,000+ for the Bible 1,537 Shakespeare’s Macbeth 1,026 Dickens’s Christmas Carol

Page 13: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

The Top 10 Works by Holdings

Work Holdings Manif’s

1 US Census (various) 403,252 10,164

2 Bible (combined) 271,534 36,738

3 Mother Goose 66,543 1,997

4 Dante, The Divine Comedy 59,034 2,714

5 Homer, The Odyssey 43,871 2,009

6 Homer, The Iliad 42,756 2,388

7 Twain, Huckleberry Finn 39,310 1,093

8 Shakespeare, Hamlet 37,683 1,917

9 Carroll, Alice’s Adventures in Wonderland 37,614 1,865

10 Tolkien, Lord of the Rings 37,461 643

Page 14: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

The Top 10 Works Cataloged in 2003

Work Libraries

1 Rowling, Harry Potter and the Order of the Phoenix 2,406

2 Clinton, Living History 36,738

3 Rohmann, My Friend Rabbit 1,997

4 Brown, The Da Vinci Code 2,714

5 Gibaldi, MLA Handbook 2,009

Page 15: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Top 1000 Publication Dates

Page 16: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Top 1000 Languages

Page 17: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Our Beowulf Cluster

24 Nodes• Each with 2x2.6 GHz processors• 4 GBytes memory (96 GBytes total)

One ‘head’ node, 23 ‘compute’ nodes

46x40 GBytes disk (~2 Terabytes total)

Gigabit switch

Page 18: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

What we are using it for

All our bibliographic processing• FRBR• Extractions• Searching• Matching

Page 19: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Ganglia load visualization

Page 20: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Starting point

FRBR key generation

25 hours on a 3.00GHz workstation with 2GB of RAM

Generate two key files• sort by key, uniq by key, sort by occurrence• sort by key, post processing on keys, uniq by key, sort

by occurrence

Merge key files

Page 21: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

FRBR on the Cluster

44 minutes on the cluster

69 key builders & 23 sort buckets with hyperthreading ON

Generate 23 radix-sorted, post-processed key files

Collapse and sort by occurrence in parallel

Also outputs additional files used by other jobs

Page 22: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Application: Preservation

Identify ‘final copy’ items Do it at the work level

Single-singles• Single manifestations with single holding• Found 18 million in WorldCat

Page 23: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Application: xISBN

A simple Web service

Given an ISBN:• Identify the workset it is in• Return all other ISBNs in that workset

Results should be symmetrical!• Same group retrieved for each ISBN in group

ISBNs sorted by number of library holdings

Page 24: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

xISBN Example

http://labs.oclc.org/xisbn/0-19-281664-0 returns:<?xml version="1.0" encoding="UTF-8" ?><idlist>

<isbn>0192816640</isbn><isbn>0820312037</isbn><isbn>0820315370</isbn><isbn>0393015920</isbn><isbn>0393952274</isbn><isbn>0393952835</isbn><isbn>0140430210</isbn><isbn>0192811320</isbn><isbn>0192835947</isbn><isbn>0460872885</isbn><isbn>1853262706</isbn><isbn>0874131219</isbn>

</idlist>

Page 25: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Matching on ISBNs

ISBN additional information beyond Author/Title• Allows relaxation of matching• Introduces possible errors

Offers the possibility of substantial improvement of work matching

Page 26: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Merging Worksets Using ISBN Matches

Pair ISBNs with FRBR keys(Starts with 10 million ISBNs)

Throw out ISBNs in single worksets Throw out ISBNs in > 5 worksets

(We now have 561,000 ISBNs left) Are the titles similar enough? Throw out large groups

Try to be very conservative Authority file always overrides other matching

Page 27: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Matches from ISBN Matching

74,000 author variants ~200,000 title variants

These all create additional cross reference records Automatically folded into FRBR matching Kept separate from NACO file

• Only used in research at this time

Page 28: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Examples of Possible Matches

/mcgraw hill encyclopedia of science & technology /mcgraw hill encyclopedia of science & technology\1\aar aor /mcgraw hill encyclopedia of science & technology\2\apa boo /mcgraw hill encyclopedia of science & technology\3\bor cle /mcgraw hill encyclopedia of science & technology\4\cli cyt …

dickens, charles\1812 1870/tale of two cities dickens, charles\1812 1870/hard times dickens, charles\1812 1870/sketches by boz dickens, charles\1812 1870/martin chuzzlewit dickens, charles\1812 1870/bleak house dickens, charles\1812 1870/little dorrit dickens, charles\1812 1870/oliver twist …

Page 29: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Application: Bookmarklets

Page 30: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Clicking on Princeton

Page 31: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

FictionFinder

Indexes fiction from WorldCat Uses FRBR workset algorithm Focused on fiction Searching and browsing by

• Genre• Fictitious Characters• Imaginary Places• Literary Forms

Links to• Google• Open WorldCat

Diane Vizine-Goetz’s project

Page 32: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

‘Humphry Clinker’ Search

Page 33: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Work Display

Page 34: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Detail of Language Display

Page 35: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

First Few English Manifestations

Page 36: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Manifestation Display

Page 37: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Open WorldCat Link

Page 38: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Additional Matches

Match variant titles:• When the wind blows• When the wind blows: a novel

FictionFinder identified 10,000 of similar variations• novela, novella, roman, …

Created auxiliary authority records Now automatically used when FRBR algorithm is run

Page 39: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Future

Continued development of FictionFinder Extending algorithm to serials? FirstSearch displays Additional matching criteria Local authority files? Integration of auxiliary files for production? Exploring FRBRizing some European catalogs Looking at extending beyond Roman characters

Page 40: FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

Links

IFLA FRBR - Final Report• http://www.ifla.org/VII/s13/frbr/frbr.htm

Article in DLib• http://www.dlib.org/dlib/september02/hickey/09hickey.ht

ml OCLC Research Activities with FRBR

• http://www.oclc.org/research/projects/frbr/ FictionFinder

• http://fictionfinder.oclc.org/ Top 1000

• http://www.oclc.org/research/top1000/