Background

ITHAKA Preservation Metadata 2.0:Revising the Event Model

A last-minute presentation on work currently in progress

Evan OwensVP, Content ManagementITHAKA (JSTOR / Portico)[email protected]

Background

•Portico Preservation Metadata designed & implemented in 2002-2003

– Inspired by PREMIS working group participation

– Operational before PREMIS was completed!

•Portico Archive as of October 2009

– >14 Million E-Journal Articles plus other content

– ~150 Million Files

– ~1 Billion Events

– Only 1K manual events; 99.999% system generated

– Over 1 TB of Preservation Metadata

•Portico / JSTOR / Ithaka merger in 2009

2.0 PMD Revision Project

•Begun in 2008; Implementation now underway

•Design Goals for Revision to Events:

– Consistent editorial/coding practices (capitalization, verb tenses, etc.)

– Clarify what event goes with which object and why

– Eliminate redundant information where possible

– Make explicit all data constraints not currently expressed in our schemas

– Synchronize event metadata with the high-level preservation metadata so that the events properly document changes in the core metadata

– Establish a clean base line for future expansion of events metadata

PMD 2.0 Design Choices

•Use our own data model / information architecture

– Optimized for Java, Oracle, and XML instantiations

– XML designed to reduce future versioning:

• XSD schema for frame (syntax) only

• All business rules (semantics) expressed in Schematron

– Not METS, not DIDL, not PREMIS XML

– PREMIS compliant

•Optimized for size and speed

– Fully relationally normalized

– Inheritable attributes / metadata

– Events attached to objects

Processing Record“master” for each processing pass

Bring together information common to all the events from a given processing pass; e.g., initial ingest, future migration, etc.

Not a real event!

Example XML serialization showing all possible child elements to illustrate the information model

Event Types

•Check: Virus, Fixity, …

•Characterize: File, …

•Generate: Desc. MD, Tech. MD, Fixity, …

•Edit: Desc. MD, …

•Set: Status, Format, Preservation Level, …

• Ingest: into Archive

•Add, Create, Remove File

Mapping PMD 2.0 to PREMIS

Observations

•Large-scale automated events feel very different from human events

• ITHAKA archive will quadruple in 2010

– Likely 3-5 billion events . . .

•Every bit of metadata has to be need justified

•Events have proved their value

– An entire talk on that subject alone

•Nothing is easy in quantities of billions

•We still have to work on full lifecycle events

•THIS IS STILL A WORK IN PROGRESS!

Documents

Background