32
Society of American Archivists Research Forum 18 August 2015 A Deep Dive into the Archival MARC Records in WorldCat (and ArchiveGrid) Jackie Dooley Program Officer OCLC Research

Society of American Archivists Research Forum 18 August 2015 A Deep Dive into the Archival MARC Records in WorldCat (and ArchiveGrid) Jackie Dooley Program

Embed Size (px)

Citation preview

Society of American Archivists Research Forum18 August 2015

A Deep Dive into the Archival MARC Records in WorldCat (and ArchiveGrid)

Jackie Dooley

Program Officer

OCLC Research

OVERVIEW

• Research objective

• Research questions

• The data set

• High-level findings

• Next steps

RESEARCH OBJECTIVE

Research Objective

Establish a detailed profile of MARC data element occurrences in archival catalog records, providing a view of 30+ years of practice.

• Reveal variations in descriptive practice

• Debunk inaccurate assumptions

• Characterize before MARC usage diminishes

• Suggest improvements in descriptive practice

• Enable analysis of implications for discovery

SAMPLE RESEARCH QUESTIONS

Sample research questions

• Are descriptions and index terms rich enough to enable effective discovery of archival materials?

• In what significant ways does archival description differ from one type of material to another?

• To what extent does use of the archival control byte successfully capture the universe of archival descriptions?

• Is it true that archivists usually describe materials at the collection level?

• How often is DACS used as the content standard? And APPM as its predecessor?

• To what extent are the DACS minimum requirements met?

THE DATA SET

Archival records in WorldCat

OCLC’s WorldCat database of 300+ million records, filtered to extract “archival” records (currently 4 million, or about 1% of the total)

Brief version of the filter specs:

• “Unpublished” materials in any format (e.g., text, visual, moving image, sound recording)

• Coded for “archival control” (Leader byte 08)• Held by a single institution (i.e., only one attached

holding)• Excludes published materials in any format, as

well as theses and dissertations

Spoiler alert: It’s not perfect.

Same records as in ArchiveGrid

• Only one library holding symbol is attached (to eliminate non-unique items or collections)• The MARC Leader has one or more of the following:

– Leader byte 06 (recordtype) has the value d (manuscript music), f (manuscript cartographic), g (projected graphics), i (nonmusic recording), j (music recording), k (visual), p (mixed), r (realia), or t (textual manuscript). [does this include all the new ones?]

– Leader byte 06 has the value "a" (language material) and Leader byte 07 (bibliographic level) has the value "c" (collection).

– Leader byte 08 has the value "a" (archival control).• Field 260 subfields "a" and "b" are not present (to filter out published works)• "Bibliography" does not occur at the beginning string of any MARC subject heading

subfield "a" or "v" (to filter out published works).• Field 502 is not present (to filter out theses and dissertations).• Records with material type "book" or "serial" that have no value in fields 008 or 006

“Nature of Contents” bytes (to eliminate theses, reference works, and other non-archival materials).

http://beta.worldcat.org/archivegrid/about/

The full filter specs:

So what do you think of our scoping of archival data elements?

Spoiler reminder: It’s not perfect.

• “Unpublished” materials in any format

• Under “archival control”

• Held by a single institution

• Excludes all published materials

Briefest version of the filter specs:

HIGH-LEVEL FINDINGS

A. Full data

B. Mixed materials

C.Text

D.Visual materials

E. Music scores

F. Maps

G.Audio recordings

25%

44%

26%

1%4%

Book

Mixed

Visual

Map

Score

Percent of records by type of material

A. Full data• “Archival control”: 28% of records

• Dates: Nearly half have date span

• Bibliographic level– 53% describe collections– 40% describe single items– “Component” levels rarely used

• 95% are mixed materials, text, or visual materials

• 85% have ≥1 indexed creator names

• 75% have ≥1 indexed subject terms

• 30% have an 856 field (link to external content)

Bibliographic level by type of material

Inclusion of 6xx (subject) index terms

All Book Map Mixed Rec Score Visual0%

20%

40%

60%

80%

100%

120%

600

610

650

651

653

655

A. Full data, cont.

• Cataloging level– 29% full cataloging– 25% minimal– 44% unknown

• Cataloging rules– Specified in 30% of records– appm in 18% of records, dacs in 7%, gihc in 5%

• Form of material: Used most heavily for non-textual materials

• Language– Two thirds in English– Not specified in ≥ 25% of records

• Place of publication vs. location of repository

B. Mixed Materials

• 44% of all records• 50% are under archival control • 94% are collection records, 5% are components

• 1xx in 70% of records• Title: 11% have no 245 $a

• Notes• 520 in 74% of records• 545 field in 31% of records• 500 field in 39% of records• No other 5xx used in ≥ 25% of records

B. Mixed Materials, cont.

• 600 in 40% of records; mean of 1.5 per record• 650 in 52% of records; mean of 3.0 per record• 651 in 45% of records; mean of 1.3 per record• 655 in 63% of records; mean of 1.3 per record

• 7xx in 28% of records

• 856 in 29% of records

C. Text

• 25% of all records– 4% are book and pamphlet collections– 21% are textual manuscripts

• 25% of textual manuscript records are under archival control

• 30% are collection records, 70% are items

• 1xx in 77% of records• Title: 11% have no 245 $a

• Notes– 43% have 520 field– 54% have 500 field

C. Text, cont.

• 600 in 31% of records; mean of 0.9 per record• 650 in 42% of records; mean of 1.7 per record• 651 in 31% of records; mean of 0.8 per record• 655 in 36% of records; mean of 0.7 per record

• 7xx in 50% of records

D. Visual Materials

• 26% of all records• ≤ 10% are under archival control• 57% have 007 (technical data values)

• 15% are collection records, 76% are items

• 1xx in 51% of records

• Notes– 500 in 77% of records– 520 in 68% of records– 540 in 57% of records

D. Visual Materials, cont.

• 600 in 32% of records; mean of 1.1 per record• 650 in 68% of records; mean of 4.2 per record• 651 in 38% of records; mean of 1.5 per record• 655 in 81% of records; mean of 1.5 per record

• 7xx in 31% of records

• 856 in 48% of records

E. Music Scores

• 4% of all records• 1xx in 90% of records• 240 in 41% of records• 500 in 96% of records; negligible use of other 5xx’s

• 650 in 96% of records; mean of 2.4 per record• 655 in 34% of records; genre/form terms often in 650

instead

• 856 in 25% of records

F. Maps

• Less than 1% of all records• 65% have 007 (technical data values)• Field 043 (hierarchical geographic area code) in 80% of

records• 052 in 66% of records (geographic classification)

• 1xx in 53% of records

• 255 in 92% of records (cartographic mathematical data)

F. Maps, cont.

• 500 in 93% of records; use of other 5xx’s negligible

• 650 in 68% of records; mean of 2.8 per record• 651 in 83% of records; mean of 2.7 per record• 655 in 84% of records; mean of 1.8 per record

• 7xx in 50% of records

G. Audio Recordings

• Less than 1% of all records• 60% have 007 (technical data values)• 1xx in 83% of record• Notes

– 500 in 77%– 520 in 68%– 530 in 27%– 540 in 57%

G. Audio Recordings, cont.

• 650 in 68%; mean of 5.2 per record• 651 in 47%; mean of .9 per record• 655 in 67% of records; mean of 1.2 per record

• 7xx in 100% of records• 856 in 22% of records

NEXT STEPS

Draw conclusions (a few for starters)

• Mixed and textual materials cataloged as collections; other formats not so much

• “Archival control” byte is far from universally used, so has little value

• Few of the note fields added for archival or visual materials communities are widely used (does it matter?)

• As many as 25% of titles for mixed and textual collections make for lousy browsing (e.g., “Papers” or “Records”)

• Ponder implications for next-gen cataloging (linked data, BIBFRAME, schema.org)

Please send feedback

• Do the data debunk any assumptions?

• Are you dubious about any of the data?

• Would you tweak the specs of our filter?

• Are changes in practice called for?

• What other questions should I be asking?

• Is this a useful project or just an “interesting” one?

Publications & future research

• Publish this data• Second paper: Implications for discovery• Future research?

– Data content– Potential for data remediation

• Generic titles (e.g., Papers, Records)• Missing language codes• Other?

– Descriptive practice for web archiving

• If you need an OCLC data set for research ...

SM

Thanks!

Jackie DooleyProgram Officer, OCLC Research

[email protected]

@minniedw

SAA Research Forum