NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

NLM Digital CollectionsUpdate for DCFedoraUsersGroup

January 22, 2013

John DoyleNational Library of Medicine

The Story So Far

2

Texts– 7,866 books, incl. 225 multi-vol sets– Medical Heritage Library

1.7m pages In-house digitization

– 1 multi-part report

Audiovisuals– 70 films – 2 thematic collections

The Saga Continues

Serials– NIH Institute annual reports– 61 volume printed index of historical citations– Journals may be coming soon

Oral Histories Still Images Born-digital resources Citation dataset

Public Interface: “Digital Collections”

Browse & Search (Muradora) Supports multiple collections, diverse content Resource display page: metadata,

datastreams Book Viewer (NWU)

Open source software from Northwestern University

Open source JPEG2000 server (Djatoka) Video Player with Search (NLM)

Features video transcript search and play-ahead jump

HHS Innovates finalist (top 6), Fall 2011 4

Replacing Muradora Muradora codebase is aging

– No community development or support Newer community projects reaching maturity

– Islandora– Hydra

Priority is to preserve/enhance resource search and browse

Probably retain the book and video viewing applications

5

Current Developments Workflows

– Increasingly concurrent content projects– Moving from project-specific to project-agnostic

Data Services– Programmatic access – search web service– Bulk data– Need to pin down use cases

Fedora framework upgrading– Journaling for propagating changes across

multiple Fedora instances

6

Current Developments

Periodic checksum checking– Make use of recent Fedora enhancements in

this area Third copy of content

– “Just in case” copy, not primary disaster recovery

– Amazon Glacier seems to be a good fit Descriptive Metadata

– More automated updating of ILS– Need to update Fedora/Solr post-ingest

7

Related Activities

Internet Archive– Over 6,500 books uploaded as part of MHL

project– Only selected datastreams going up– Expect to continue sending books to IA going

forward Hathi Trust

– Working group delivered recommendations last year

– Participation could involve an IA-to-HT path– Some bibliographic challenges to be met

NLM Digital CollectionsSupport for Multi-volume texts

January 22, 2013

Nancy Fallgren, Doron ShalviNational Library of Medicine

Outline Regular book processing Regular book data model and presentation What is a multi-volume? Multi-volume metadata issues Multi-volume scanning and identifiers Multi-volume metadata generation and workflow Asynchronous volume processing (a.k.a. Jail) Multi-volume data model and presentation Software adjustments Questions

10

Regular book processing Voyager record

– One to one relationship between BIB record and digital object

Metadata processing– MARCXML to OAI-DC and DMDINDEX

Preingest process– Create derivatives– Generate FOXML– Locate files

Ingest into Fedora

11

Regular book data model

12

ID TYPE MIMETYPE LABEL

PID - - Fedora persistent identifier

DC X text/xml Dublin Core metadata for this object

RELS-EXT X application/rdf+xml

RDF statements about this object

MARCXML M text/xml MARCXML metadata

DMDINDEX X text/xml DMDINDEX descriptive metadata

METS M text/xml METS file for entire book

OCR E text/plain Book OCR - full text of entire book

PDF E application/pdf PDF of entire book

THUMB E image/jpeg JPG Thumbnail image of selected page in book

Preview E image/jpeg JPG Preview image of selected page in book

Regular book presentation

13

What is a Multi-volume? Multiple volume monographic series

– All volumes share the same series title– Each volume may or may not have a unique

title– The series has a finite beginning and end

Unanalyzed cataloging, i.e., the entire set is cataloged as a single unit, individual volumes do not have their own catalog/BIB records

Not journals or serials

14

Multi-volume metadata issues

One to many relationship between the Voyager BIB record (for the series) and the digital objects (each volume)– NLM UID (MARC 035$9) is the basis for each

digital object’s PID– Disambiguating volume titles

Distinguishing multi-vol pre- and post-ingest processing workflows from monograph workflows

Scanning Spreadsheets:UIDs and volume nos.

From spreadsheet to XML

Set/Parent MARCXML

New child/volume MARCXML

Set/Parent DC

Child/Volume DC

Disambiguating Multi-volume workflows

Transform pre-ingest manifests (UID lists)– Remove all UIDs with “X#” suffix

Transform post-ingest manifests– Remove all “X#” suffixes from UIDs– De-dupe the remaining list– Add only set/parent url to BIB records

DREPSERIES code

Asynchronous Volume processinga.k.a. Jail

Do not pass GO, do not collect $200 Volumes are scanned and processed

asynchronously Set object created for first child part Standard processing and review workflow Volumes held in Jail – no further processing – until all

volumes pass manual review on Fedora QA system Once all volumes reviewed, full set promoted to

Production

Multi-volume set data model

24








THUMB E image/jpeg JPG Thumbnail image of selected page in set

Preview E image/jpeg JPG Preview image of selected page in set

Same data model as book, but no METS, OCR or PDF

Multi-volume part data model

25








METS M text/xml METS file for entire book

OCR E text/plain Book OCR - full text of entire book

PDF E application/pdf PDF of entire book

THUMB E image/jpeg JPG Thumbnail image of selected page in book

Preview E image/jpeg JPG Preview image of selected page in book Same data model as book

Multi-volume relationships

26

Set Part

fedora:hasPart

fedora:isPartOf

Multi-volume presentation - set

27

Multi-volume presentation - part

28

Software adjustments

Creation of new content models – mvset, mvpart New process to generate FOXML, capture thumb New relationships in RELS-EXT Adjustment of UI and business logic to handle sets – link

to all parts, query part names from Solr Adjustment of UI to handle child parts – link back to set Hide basic display of dc.relation – info in hotlinks instead More abstract content models, to reduce redundant

changes, would have helped

Demonstration

http://collections.nlm.nih.gov

http://collections.nlm.nih.gov/

Documents

NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine