Upload
eustace-taylor
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
NLM Digital CollectionsUpdate for DCFedoraUsersGroup
January 22, 2013
John DoyleNational Library of Medicine
The Story So Far
2
Texts– 7,866 books, incl. 225 multi-vol sets– Medical Heritage Library
1.7m pages In-house digitization
– 1 multi-part report
Audiovisuals– 70 films – 2 thematic collections
The Saga Continues
Serials– NIH Institute annual reports– 61 volume printed index of historical citations– Journals may be coming soon
Oral Histories Still Images Born-digital resources Citation dataset
Public Interface: “Digital Collections”
Browse & Search (Muradora) Supports multiple collections, diverse content Resource display page: metadata,
datastreams Book Viewer (NWU)
Open source software from Northwestern University
Open source JPEG2000 server (Djatoka) Video Player with Search (NLM)
Features video transcript search and play-ahead jump
HHS Innovates finalist (top 6), Fall 2011 4
Replacing Muradora Muradora codebase is aging
– No community development or support Newer community projects reaching maturity
– Islandora– Hydra
Priority is to preserve/enhance resource search and browse
Probably retain the book and video viewing applications
5
Current Developments Workflows
– Increasingly concurrent content projects– Moving from project-specific to project-agnostic
Data Services– Programmatic access – search web service– Bulk data– Need to pin down use cases
Fedora framework upgrading– Journaling for propagating changes across
multiple Fedora instances
6
Current Developments
Periodic checksum checking– Make use of recent Fedora enhancements in
this area Third copy of content
– “Just in case” copy, not primary disaster recovery
– Amazon Glacier seems to be a good fit Descriptive Metadata
– More automated updating of ILS– Need to update Fedora/Solr post-ingest
7
Related Activities
Internet Archive– Over 6,500 books uploaded as part of MHL
project– Only selected datastreams going up– Expect to continue sending books to IA going
forward Hathi Trust
– Working group delivered recommendations last year
– Participation could involve an IA-to-HT path– Some bibliographic challenges to be met
NLM Digital CollectionsSupport for Multi-volume texts
January 22, 2013
Nancy Fallgren, Doron ShalviNational Library of Medicine
Outline Regular book processing Regular book data model and presentation What is a multi-volume? Multi-volume metadata issues Multi-volume scanning and identifiers Multi-volume metadata generation and workflow Asynchronous volume processing (a.k.a. Jail) Multi-volume data model and presentation Software adjustments Questions
10
Regular book processing Voyager record
– One to one relationship between BIB record and digital object
Metadata processing– MARCXML to OAI-DC and DMDINDEX
Preingest process– Create derivatives– Generate FOXML– Locate files
Ingest into Fedora
11
Regular book data model
12
ID TYPE MIMETYPE LABEL
PID - - Fedora persistent identifier
DC X text/xml Dublin Core metadata for this object
RELS-EXT X application/rdf+xml
RDF statements about this object
MARCXML M text/xml MARCXML metadata
DMDINDEX X text/xml DMDINDEX descriptive metadata
METS M text/xml METS file for entire book
OCR E text/plain Book OCR - full text of entire book
PDF E application/pdf PDF of entire book
THUMB E image/jpeg JPG Thumbnail image of selected page in book
Preview E image/jpeg JPG Preview image of selected page in book
Regular book presentation
13
What is a Multi-volume? Multiple volume monographic series
– All volumes share the same series title– Each volume may or may not have a unique
title– The series has a finite beginning and end
Unanalyzed cataloging, i.e., the entire set is cataloged as a single unit, individual volumes do not have their own catalog/BIB records
Not journals or serials
14
Multi-volume metadata issues
One to many relationship between the Voyager BIB record (for the series) and the digital objects (each volume)– NLM UID (MARC 035$9) is the basis for each
digital object’s PID– Disambiguating volume titles
Distinguishing multi-vol pre- and post-ingest processing workflows from monograph workflows
Scanning Spreadsheets:UIDs and volume nos.
From spreadsheet to XML
Set/Parent MARCXML
New child/volume MARCXML
Set/Parent DC
Child/Volume DC
Disambiguating Multi-volume workflows
Transform pre-ingest manifests (UID lists)– Remove all UIDs with “X#” suffix
Transform post-ingest manifests– Remove all “X#” suffixes from UIDs– De-dupe the remaining list– Add only set/parent url to BIB records
DREPSERIES code
Asynchronous Volume processinga.k.a. Jail
Do not pass GO, do not collect $200 Volumes are scanned and processed
asynchronously Set object created for first child part Standard processing and review workflow Volumes held in Jail – no further processing – until all
volumes pass manual review on Fedora QA system Once all volumes reviewed, full set promoted to
Production
Multi-volume set data model
24
ID TYPE MIMETYPE LABEL
PID - - Fedora persistent identifier
DC X text/xml Dublin Core metadata for this object
RELS-EXT X application/rdf+xml
RDF statements about this object
MARCXML M text/xml MARCXML metadata
DMDINDEX X text/xml DMDINDEX descriptive metadata
THUMB E image/jpeg JPG Thumbnail image of selected page in set
Preview E image/jpeg JPG Preview image of selected page in set
Same data model as book, but no METS, OCR or PDF
Multi-volume part data model
25
ID TYPE MIMETYPE LABEL
PID - - Fedora persistent identifier
DC X text/xml Dublin Core metadata for this object
RELS-EXT X application/rdf+xml
RDF statements about this object
MARCXML M text/xml MARCXML metadata
DMDINDEX X text/xml DMDINDEX descriptive metadata
METS M text/xml METS file for entire book
OCR E text/plain Book OCR - full text of entire book
PDF E application/pdf PDF of entire book
THUMB E image/jpeg JPG Thumbnail image of selected page in book
Preview E image/jpeg JPG Preview image of selected page in book Same data model as book
Multi-volume relationships
26
Set Part
fedora:hasPart
fedora:isPartOf
Multi-volume presentation - set
27
Multi-volume presentation - part
28
Software adjustments
Creation of new content models – mvset, mvpart New process to generate FOXML, capture thumb New relationships in RELS-EXT Adjustment of UI and business logic to handle sets – link
to all parts, query part names from Solr Adjustment of UI to handle child parts – link back to set Hide basic display of dc.relation – info in hotlinks instead More abstract content models, to reduce redundant
changes, would have helped