Digital Preservation Dale Flecker Stephen Abrams February 15, 2007 HUL University Library Council

Preview:

Citation preview

Digital Preservation

Dale FleckerStephen Abrams

February 15, 2007

HUL University Library Council

Agenda

I The problem

II What has Harvard been doing?

III What more do we need to do?

I The problem …

… is twofold

• Keeping the bits

• Keeping the bits useful

Keeping the bits

• Digital things are amazingly easy to destroy!– Bad guys want to do damage– Hardware/software fails– People make mistakes

• The slip of a finger, or an unnoticed consequence of change, happen easily - and are potentially catastrophic

Destruction is not always apparent

Data not used regularly is alwaysat risk of unintended and

unnoticed damage.(Note that archival copies can

be pretty invisible…)

Keeping bits useful

Digital materials are fragile!!!

They depend on technologies for their vitality… and those technologies

age and disappear rapidly.

Fragility

• Using digital content requires mediation by hardware and software

• Hardware and software must understand the format of the content

• Hardware and software technology change continually

Fragility

• Old technology will break

• New technology frequently does not understand old formats

II What has Harvard been doing?

Internally …

Digital Repository Service (DRS)

• Secure, professionally managed environment

– Manage data rigorously, with discipline, and in accordance to community best practices

• Redundant, heterogeneous, distributed storage with periodic media migration…

Digital Repository Service (DRS)

• Know what data you have

– What are the logical objects (“works”, not files)?

– What are the technical characteristics of those objects?

• Check the data continuously

• Manage access to stored objects

Format

• Understanding formats is fundamental to preservation

ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...

Format

• Understanding formats is fundamental to preservation

ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...

SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...

Format

• Understanding formats is fundamental to preservation

ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...

SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...

Format

• Formats vary significantly in their “preservability”

• Keeping multiple versions of a given piece of content for different purposes is frequently wise

– E.g. archival master, production master, use copy

Format

• Some criteria for “preservability” (from LC)– Disclosure (how well documented?)– Adoption (how widely used?)– Transparency (is compression used?)– Self documenting (good!)– External dependencies (self sufficiency is good)– Patents (could limit preservation actions)– DRM/encryption (what if decryption key is not

available?)

Metadata

• The basis of decision-making for preservation

– Technical metadata• What format is this in?• What format options are used?

– Structural metadata• If I change this, what else is affected?

Metadata

– Administrative metadata

• Who has the right to make decisions about this?

– Relationship metadata

• Are there other versions of this object?

– How do these affect my preservation strategy?

– Provenance metadata

• Where did this come from?• What changes has it already undergone?

Guidelines for “preservable” objects

The least expensive, and mosteffective preservation measure

is to think about the future whenan object is created!

(Guidelines on format, metadata,archival masters, etc.)

JHOVE (JSTOR/Harvard Object Validation

Environment)

A widely used tool for format identification, validation, and

characterization.

JHOVE (JSTOR/Harvard Object Validation

Environment)

When an object is ingested:• Determine its format

(“identify”)

• Insure that it is properly formed(“validate”)

• Extract meaningful technical metadata(“characterize”)

DRS: what’s managed today

As of January 2007, 5.6M files and 22 TB, excluding Google and web archiving

II What has Harvard been doing?

Externally…

E-journal archiving

• “How can we ensure that licensed e-journal content will remain usable over time?”

• Mellon-funded study

• Explored technical formats, content types, transactions and dataflows, validation, systems requirements, contractual requirements, business models

• Harvard’s proposed model largely implemented by Portico

Technical Metadata for Digital Still Images

• “What are the appropriate technical metadata necessary for the preservation of images?”

• Standardized as NISO Z39.87

• Expressed in the MIX schema

– Maintained by LC

• The basis for DRS image technical metadata

METS (Metadata Encoding and Transmission

Standard)

• “Is there a generic packaging form for digital content?”

• For example,

– Digital books

– Audio works

– Images (archival master, production master, deliverables)

• Useful for exchange of objects between repositories

• Maintained by LC

Core audio metadata

• “What are the appropriate technical metadata necessary for the preservation of audio?”

• Standardized as AES X-098

• Used as the basis for DRS audio technical metadata

PDF/A

• “PDF defines too many options; is there a ‘flavor’ that will be more ‘preservable’ over time?”

• Requires, recommends, and restricts PDF functionality to enhance preservability

• Standardized as ISO 19005

PREMIS PREservation Metadata: Implementation Strategies

• “What are the general metadata elements necessary to preserve digital content over time?”

• OCLC/RLG-sponsored work group

• Recommendations and best practices for preservation metadata– Core elements, data dictionary, implementation

strategies, cooperative projects

PREMIS PREservation Metadata: Implementation Strategies

• Report on current practices and recommended metadata elements available

• Maintained by LC

AIHT (Archive Ingest and Handling Test)

• “What difficulties can we expect to arise during the exchange of content between heterogeneous repositories?”

• LC-funded project to investigate exchange of complex data between preservation repositories

• Harvard, Stanford, Johns Hopkins, Old Dominion ingest and exchange web archive data

GDFR (Global Digital Format Registry)

• “What will need to know in the future about formats in use today, and how will we know it?”

• Shared registry of preservation-related information about technical format

• Reduce work for repositories to create and maintain information about objects they ingest…

GDFR (Global Digital Format Registry)

• Enables sharing of format expertise

• Directed by Harvard, implemented by OCLC

• Funded by Mellon Foundation

Registry of Digital Masters

• “How can I found out who has accepted archival responsibility for a given piece of content?”

• Initially reformatted materials; intention to expand to born-digital

• DLF project

• Implemented by and housed at OCLC

Repository certification

• “Why should a collection manager trust a digital repository?”

• RLG/OCLC report on Trusted Repository Attributes

• RLG/NARA Digital Repository Certification Task Force…

Repository certification

• Recommend structure and metrics of an international process for certifying preservation repositories– Organizational role and structure, staff size and

skill, formal operations and documentation, appropriate technical infrastructure and facilities, on-going funding, and “hand-off” plan, etc.

• CRL Auditing and Certification project

Key activities elsewhere

• ISO 14721 OAIS (Open Archival Information System)

• LC NDIIPP (National Digital Information Infrastructure Preservation Program)

• Web archiving (IA, IIPC)

• NARA ERA (Electronic Records Archiving)

• Digital Curation Centre

• PLANETS

III What more do we need to do?

Evolution: from projects to program

• Digital preservation requires continual pro-active program– You can’t just stop and start– Time frames are MUCH shorter than for preservation

of physical collections

• Need to define scope and role of our preservation efforts

• Investment required in both technology and staffing

Preservation lifecycle

• Creation– Format and technical specification choices– Accompanying metadata– Packaging for ingest

• Ingest– Validation– Normalization

Preservation lifecycle

• Assumption of preservation responsibility

• Monitoring– When is intervention necessary?

• Changes to the technical environment

• Changes to user expectations

• Planning– Significant properties

• All preservation decisions involve choice; how to choose what to preserve?

Preservation lifecycle

• Intervention (preserving usability)– Re-acquisition– Re-generation from an archival master– Migration before necessary (“just in case”)– Migration at point of request (“just in time”)– Emulation of obsolete technology in contemporary

environment– Universal Virtual Computer (UVC)

• Rewrite necessary software to run on technology-agnostic “virtual” computer

Preservation lifecycle

• Intervention (continued)– Save for digital archeologists

• After intervention– Post-intervention quality assurance– Documenting the process of change

• Succession planning– What do we do when we want to get out of the

repository business?

Staffing and responsibilities

• Technical– Infrastructure maintenance– Monitor technological change– Integration into larger preservation environment– Preservation planning

• Curatorial – Preservation intervention will involve trade-offs

• What attributes need to be preserved?• Cost/benefit analysis

Immediate challenges

• Google– Substantial increase in scale (both number and size)– “Dark” content; no expectation of current access

• Web archiving– Explosion of data types– No forethought on format selection and technical

specifications– No metadata– Some failure may be inevitable

Coming soon?

• Institutional repository (IR) to enhance scholarly communication and preserve scholarly creations– Similar to web archiving: objects not typically

created with preservation in mind, nor accompanied by metadata

• “Just in case” local copies of licensed content– May necessitate increased sophistication of IPR

management

Longer term issues

• Economics – What can we afford to preserve?• Scale – How much can we preserve?• Selection – What do we leave for others?• Federation – Can we share responsibilities for

preservation?– Copies in independent environments are safest

• Certification – Do we need formal certification?– Note Section 108 revision

• Education – who at Harvard needs to understand?