30
NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Embed Size (px)

Citation preview

Page 1: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

NLM Digital CollectionsUpdate for DCFedoraUsersGroup

January 22, 2013

John DoyleNational Library of Medicine

Page 2: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

The Story So Far

2

Texts– 7,866 books, incl. 225 multi-vol sets– Medical Heritage Library

1.7m pages In-house digitization

– 1 multi-part report

Audiovisuals– 70 films – 2 thematic collections

Page 3: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

The Saga Continues

Serials– NIH Institute annual reports– 61 volume printed index of historical citations– Journals may be coming soon

Oral Histories Still Images Born-digital resources Citation dataset

Page 4: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Public Interface: “Digital Collections”

Browse & Search (Muradora) Supports multiple collections, diverse content Resource display page: metadata,

datastreams Book Viewer (NWU)

Open source software from Northwestern University

Open source JPEG2000 server (Djatoka) Video Player with Search (NLM)

Features video transcript search and play-ahead jump

HHS Innovates finalist (top 6), Fall 2011 4

Page 5: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Replacing Muradora Muradora codebase is aging

– No community development or support Newer community projects reaching maturity

– Islandora– Hydra

Priority is to preserve/enhance resource search and browse

Probably retain the book and video viewing applications

5

Page 6: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Current Developments Workflows

– Increasingly concurrent content projects– Moving from project-specific to project-agnostic

Data Services– Programmatic access – search web service– Bulk data– Need to pin down use cases

Fedora framework upgrading– Journaling for propagating changes across

multiple Fedora instances

6

Page 7: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Current Developments

Periodic checksum checking– Make use of recent Fedora enhancements in

this area Third copy of content

– “Just in case” copy, not primary disaster recovery

– Amazon Glacier seems to be a good fit Descriptive Metadata

– More automated updating of ILS– Need to update Fedora/Solr post-ingest

7

Page 8: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Related Activities

Internet Archive– Over 6,500 books uploaded as part of MHL

project– Only selected datastreams going up– Expect to continue sending books to IA going

forward Hathi Trust

– Working group delivered recommendations last year

– Participation could involve an IA-to-HT path– Some bibliographic challenges to be met

Page 9: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

NLM Digital CollectionsSupport for Multi-volume texts

January 22, 2013

Nancy Fallgren, Doron ShalviNational Library of Medicine

Page 10: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Outline Regular book processing Regular book data model and presentation What is a multi-volume? Multi-volume metadata issues Multi-volume scanning and identifiers Multi-volume metadata generation and workflow Asynchronous volume processing (a.k.a. Jail) Multi-volume data model and presentation Software adjustments Questions

10

Page 11: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Regular book processing Voyager record

– One to one relationship between BIB record and digital object

Metadata processing– MARCXML to OAI-DC and DMDINDEX

Preingest process– Create derivatives– Generate FOXML– Locate files

Ingest into Fedora

11

Page 12: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Regular book data model

12

ID TYPE MIMETYPE LABEL

PID - - Fedora persistent identifier

DC X text/xml Dublin Core metadata for this object

RELS-EXT X application/rdf+xml

RDF statements about this object

MARCXML M text/xml MARCXML metadata

DMDINDEX X text/xml DMDINDEX descriptive metadata

METS M text/xml METS file for entire book

OCR E text/plain Book OCR - full text of entire book

PDF E application/pdf PDF of entire book

THUMB E image/jpeg JPG Thumbnail image of selected page in book

Preview E image/jpeg JPG Preview image of selected page in book

Page 13: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Regular book presentation

13

Page 14: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

What is a Multi-volume? Multiple volume monographic series

– All volumes share the same series title– Each volume may or may not have a unique

title– The series has a finite beginning and end

Unanalyzed cataloging, i.e., the entire set is cataloged as a single unit, individual volumes do not have their own catalog/BIB records

Not journals or serials

14

Page 15: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Multi-volume metadata issues

One to many relationship between the Voyager BIB record (for the series) and the digital objects (each volume)– NLM UID (MARC 035$9) is the basis for each

digital object’s PID– Disambiguating volume titles

Distinguishing multi-vol pre- and post-ingest processing workflows from monograph workflows

Page 16: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Scanning Spreadsheets:UIDs and volume nos.

Page 17: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

From spreadsheet to XML

Page 18: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Set/Parent MARCXML

Page 19: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

New child/volume MARCXML

Page 20: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Set/Parent DC

Page 21: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Child/Volume DC

Page 22: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Disambiguating Multi-volume workflows

Transform pre-ingest manifests (UID lists)– Remove all UIDs with “X#” suffix

Transform post-ingest manifests– Remove all “X#” suffixes from UIDs– De-dupe the remaining list– Add only set/parent url to BIB records

DREPSERIES code

Page 23: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Asynchronous Volume processinga.k.a. Jail

Do not pass GO, do not collect $200 Volumes are scanned and processed

asynchronously Set object created for first child part Standard processing and review workflow Volumes held in Jail – no further processing – until all

volumes pass manual review on Fedora QA system Once all volumes reviewed, full set promoted to

Production

Page 24: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Multi-volume set data model

24

ID TYPE MIMETYPE LABEL

PID - - Fedora persistent identifier

DC X text/xml Dublin Core metadata for this object

RELS-EXT X application/rdf+xml

RDF statements about this object

MARCXML M text/xml MARCXML metadata

DMDINDEX X text/xml DMDINDEX descriptive metadata

THUMB E image/jpeg JPG Thumbnail image of selected page in set

Preview E image/jpeg JPG Preview image of selected page in set

Same data model as book, but no METS, OCR or PDF

Page 25: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Multi-volume part data model

25

ID TYPE MIMETYPE LABEL

PID - - Fedora persistent identifier

DC X text/xml Dublin Core metadata for this object

RELS-EXT X application/rdf+xml

RDF statements about this object

MARCXML M text/xml MARCXML metadata

DMDINDEX X text/xml DMDINDEX descriptive metadata

METS M text/xml METS file for entire book

OCR E text/plain Book OCR - full text of entire book

PDF E application/pdf PDF of entire book

THUMB E image/jpeg JPG Thumbnail image of selected page in book

Preview E image/jpeg JPG Preview image of selected page in book Same data model as book

Page 26: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Multi-volume relationships

26

Set Part

fedora:hasPart

fedora:isPartOf

Page 27: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Multi-volume presentation - set

27

Page 28: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Multi-volume presentation - part

28

Page 29: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Software adjustments

Creation of new content models – mvset, mvpart New process to generate FOXML, capture thumb New relationships in RELS-EXT Adjustment of UI and business logic to handle sets – link

to all parts, query part names from Solr Adjustment of UI to handle child parts – link back to set Hide basic display of dc.relation – info in hotlinks instead More abstract content models, to reduce redundant

changes, would have helped

Page 30: NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine

Demonstration

http://collections.nlm.nih.gov