Requirements for Long-Term Preservation
David Giaretta1st October 2009, Helsinki
Digital Preservation…
Easy to do… …as long as you can provide money forever Easy to test claims about repositories… …as long as you live a long time
Digital Preservation
activities
Infrastructure
Information about
users and practices
ISO standard: OAIS
ISO standard: OAIS update
ISO standards: Audit and Certification
Tools
Relationship to related work and
community practices
Alliance for Permanent Access• The Alliance
aims to develop a shared vision and framework for a sustainable organisational infrastructure for permanent access to scientific information
The British Library European Organization for Nuclear Research [CERN]CSC — IT Center for ScienceDelegation of the Finnish Academies of Science and Letters Deutsche Nationalbibliothek Digital Preservation Coalition European Science Foundation [ESF] European Space Agency [ESA] Helmholtz-Gemeinschaft Deutscher Forschungszentren International Association of Scientific, Technical & Medical Publishers Joint Information Systems Committee [JISC] Koninklijke Bibliotheek Max-Planck-Gesellschaft NESTOR Kompenteznetzwerk Nationale Coalitie Digitale Duurzaamheid [NCDD] Portico Science & Technology Facilities Council [STFC]
http://www.alliancepermanentaccess.org/
Alliance for Permanent Access• The Alliance
aims to develop a shared vision and framework for a sustainable organisational infrastructure for permanent access to scientific information
The British Library European Organization for Nuclear Research [CERN]CSC — IT Center for ScienceDelegation of the Finnish Academies of Science and Letters Deutsche Nationalbibliothek Digital Preservation Coalition European Science Foundation [ESF] European Space Agency [ESA] Helmholtz-Gemeinschaft Deutscher Forschungszentren International Association of Scientific, Technical & Medical Publishers Joint Information Systems Committee [JISC] Koninklijke Bibliotheek Max-Planck-Gesellschaft NESTOR Kompenteznetzwerk Nationale Coalitie Digitale Duurzaamheid [NCDD] Portico Science & Technology Facilities Council [STFC]
http://www.alliancepermanentaccess.org/PARSE.Insight
Preservation is a Social activity
Sometimes are activities are personal “preserve for your future self” [Australia]
In the short term for re-use by colleagues and other people
In the long term for re-use by future generations
Neeri 20091-2 Oct 2009, Helsinki
Definitions (OAIS)
Long Term Preservation: The act of maintaining information, Independently Understandable by a Designated Community, and with evidence supporting its Authenticity, over the Long Term.
Long Term: A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing Designated Community, on the information being held in an OAIS. This period extends into the indefinite future.
Neeri 20091-2 Oct 2009, Helsinki
Not just BIT preservation
Not just rendering
Information not just DATA or Documents
Authenticity
Things change/disappear
Software Hardware Environment
E.g. Network links to related information People
What is “common knowledge”
How can we ensure that the information trapped in the “bits” remains understandable despite all these changes?
Just Format?
sfqsftfoubujpo jogpsnbujpo svmftrepresentation information rules
You have a file
JHOVE tells you it is WORD version 7
Format – necessary but not sufficient:
formats can be used for multiple purposes e.g. audio files used to store configuration parameters
XML enough?
<family> <father>John</father> <mother>Mary</mother> <son>Paul</son></family>
<VOTABLE version="1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ivoa.net/xml/VOTable/v1.1 http://www.ivoa.net/xml/VOTable/v1.1" xmlns="http://www.ivoa.net/xml/VOTable/v1.1"><RESOURCE><TABLE name="6dfgs_E7_subset" nrows="875"><PARAM arraysize="*" datatype="char" name="Original Source"
value="http://www-wfau.roe.ac.uk/6dFGS/6dfgs_E7.fld.gz"><DESCRIPTION>URL of data file used to create this table.</DESCRIPTION></PARAM><PARAM arraysize="*" datatype="char" name="Comment" value="Cut down 6dfGS dataset for TOPCAT
demo usage."/><FIELD arraysize="15" datatype="char" name="TARGET"><DESCRIPTION>Target name</DESCRIPTION></FIELD><FIELD arraysize="11" datatype="char" name="DEC" unit="DMS"><DATA><FITS><STREAM encoding='base64'>U0lNUExFICA9ICAgICAgICAgICAgICAgICAgICBUIC8gU3RhbmRhcmQgRklUUyBmb3JtYXQgICAgICAgICAgICAgICAgICAgICAgICAgICBCSVRQSVggID0gICAgICAgICAgICAgICAgICAgIDggLyBDaGFyYWN0ZXIgZGF0YSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIE5BWElTICAgPSAgICAgICAgICAgICAgICAgICAgMCAvIE5vIGltYWdlLCBqdXN0IGV4dGVuc2lvbnMgICAgICAgICAgICAgICAgICAgICAg
Data…
Level 2 GOME Satellite instrument data
Complex container objects
Neeri 20091-2 Oct 2009, Helsinki
Key OAIS Concepts
Claiming “This is being preserved” is untestable Essentially meaningless
Except “BIT PRESERVATION” How can we make it testable?
Claim to be able to continue to“do something” with it Understand/use
Need Representation Information Still meaningless…
Things are too interrelated Representation Information potentially unlimited
Designated Community Many other concepts identified Finer grained taxonomy than simply saying
Allows one to ask if one has all the required typesAvailable from: http://public.ccsds.org/publications/archive/650x0b1.pdf
“Metadata”
Representation Information
The Information Model is key
Recursion ends at KNOWLEDGEBASE of the DESIGNATED COMMUNITY
(this knowledge will change over time and region)
OAIS Archival Information Package (AIP)
Neeri 20091-2 Oct 2009, Helsinki
Archival
Package
Contentfurther described by
Package Packaging
derivedfrom
describedby
delimitedby
DataObject
PhysicalObject
DigitalObject
StructureReferenceOther
Interpretedusing
Interpretedusing*
1
11...*
Bit
addsmeaning
to
Provenance Context Fixity AccessRights
Representation Information Network
Neeri 20091-2 Oct 2009, Helsinki
Preservation and Re-use Unfamiliar information
Preservation Digitally encoded information which must be
usable and understandable Unfamiliar because of separation in time
E-Science/GRID/CyberInfrastructure for data Digitally encoded information which must be
usable and understandable Unfamiliar because of separation in discipline or
location – even if created yesterday
Support automated usage where possible
•Rep
•Info
/DISCIPLINE
•Virtualisation
Insight: stakeholders
Research• Research institutes (non-profit)• Universities• Academic libraries
Data management (preservation)• Data centres (profit / non-profit)• Libraries• Archives
Funding/policy• National Funding organisations• European funding• Corporate funding
Publishing• General (cross-community) publishers• Specific (community) publishers
Surveys to stakeholders
ResearchElsevier mailinglist (35,000 people), ESF, MCFA, Eurodoc, ALLEA, YEAR, Digital Humanities Observatory, etc.
Data management (preservation)LIBER, DPE, DPC, NCDD, DCC, D-lib Magazine, PADI, JISC mailing lists, CASPAR, Planets, etc.
Funding/policyESF, Alliance for Permanent Access, national funding agencies
PublishingInternational Association of STM publishers, Directory of Open Access Journals (DOAJ)
Surveys to stakeholders
Research
1397 responses
Data management (preservation)
273 responses
Funding/policy
< responses
Publishing
186 responses
Threats to preservation
1. Users may be unable to understand or use the data e.g. the semantics, format or algorithms involved.
2. Lack of sustainable hardware, software or support of computer environment may make the information inaccessible.
3. Evidence may be lost because the origin and authenticity of the data may be uncertain.
4. Access and use restrictions (e.g. Digital Rights Management) may not be respected in the future.
5. Loss of ability to identify the location of data.6. The current custodian of the data, whether an organisation
or project, may cease to exist at some point in the future.7. The ones we trust to look after the digital holdings may let
us down.
Threats to preservation (R)
The ones we trust to look after the digital holdings may let us down
The current custodian of the data may cease to exist
Loss of ability to identify the location of data
Access and use restrictions may not be respected in the future
Evidence may be lost
Lack of sustainable hardware/software
Users may be unable to understand or use the data
Threat Requirement for solution
Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved
Ability to create and maintain adequate Representation Information
Non-maintainability of essential hardware, software or support environment may make the information inaccessible
Ability to share information about the availability of hardware and software and their replacements/substitutes
The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity
Ability to bring together evidence from diverse sources about the Authenticity of a digital object
Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future
Ability to deal with Digital Rights correctly in a changing and evolving environment
Loss of ability to identify the location of data An ID resolver which is really persistent
The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future
Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation
The ones we trust to look after the digital holdings may let us down
Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term
FUTURE
• Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved
• Non-maintainability of essential hardware, software or support environment may make the information inaccessible
• The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity
• Access and use restrictions may not be respected in the future• Loss of ability to identify the location of data• The current custodian of the data, whether an organisation or
project, may cease to exist at some point in the future• The ones we trust to look after the digital holdings may let us
down
Roadmap
PARSE.Insight produced draft Preservation Infrastructure Roadmap
Now a SCIENCE DATA INFRASTRUCTURE ROADMAP after consultation with EU
Infrastructures for preservation
Social / Legal / Financial / Organisational
Agreements / Trust / Standards Costs/ Benefits/ RewardsTechnical components
Lessons from other Infrastructures
Need to “grow”, “encourage”, “foster” rather than “build”
include organisational, financial, legal & marketing
Provide services rather than specific technologies
Tackle “choke points” Various phases of development
Encouraging Organisational and Social change
Policies: mandates for depositing research data and funding agencies requirements:
Robust and reliable deposit places, where researchers can be sure their data will not get lost, be corrupted or misused with correct right access mechanisms.
Elements that increase comfort levels so that new users will know how to use and interpret the available data. .
Communication and awareness around these issues. Have publication of data as valued and as
referencable as is a publication of a paper in a journal.
Repository Audit and Certification
Standard for certification in OAIS Roadmap Initial work produced TRAC Now an official CCSDS Working Group Open virtual meetings, notes and documents:
http://www.digitalrepositoryauditandcertification.org Draft standard submitted to CCSDS/ISO to
form the basis of an international audit and certification process
36
CASPAR Consortium
http://www.casparpreserves.eu
EU FP6 Integrated Project
Total spend approx. 16MEuro (8.8 MEuro from EU)
Started April 2006, for 42 months
http://developers.casparpreserves.eu:8080
Preservation Data Flows and Strategies
More strategies than just “emulate or transform”
Creating an OAIS Archival Information Package
Modules and Dependencies:defining the Designated Community
README.txt
TEXT EDITORENGLISH
LANGUAGE
WINDOWS XP
FITS FILE
FITS STANDARD
PDF STANDARD
FITSJAVA s/w
JAVA VMPDF s/w
FITS DICTIONARY
DICTIONARYSPECIFICATION
UNICODESPECIFICATION
XMLSPECIFICATION
MULTIMEDIA PERFORMANCE DATA
C3D DirectX MAX/MSP
3D motiondata files
3D scenedata files
motion to musicmapping strategy
Modules and Dependencies: Examples (Semantic Web data)
ns4
ns2
ns1
ns3
RDF/S
modules and dependencies
Scenario: Intelligibility-aware Packaging
FITS
FITS STANDARD
PDF STANDARD
FITS DICTIONARY
DICTIONARYSPECIFICATION
UNICODESPECIFICATION
XMLSPECIFICATION
o2o1
P1
P2
C3D DirectX MAX/MSP
o3
P3
ZIP
• Gap(o2,P1) = • Gap(o2,P2) =
– {FITS, FITS_STANDARD, FITS_DICTIONARY, DICTIONARY_SPECIFICATION}
• Gap(o2,P3) = – {FITS, FITS_STANDARD, FITS_DICTIONARY,
DICTIONARY_SPECIFICATION, PDF_STANDARD, XML_SPECIFICATION, UNICODE_SPECIFICATION}
• Gap(o3,P3) = – {ZIP}
• Gap(o3, ) = – {ZIP, C3D, DirectX, MAX/MSP}
E39. ActorKia Ng Activity of
Improvisation on the Violin
Expression of theImprovisation on the Violin
CR20. PerformSingleton
has_type
CR51. Attribution_RightSingleton
generates
LF1. Written_NormArt. X of Law Y
is_documented_in
Kia’s right to claim authorship
became_owner_of
is_on
created
carried_out
Work’s Provenance
Legislation
Rights Ontology CIDOC-CRM
E72. Legal Object
FRBRoo
F22. Self_contained_Expression
E7. Activity
F28. Expression_Creation
E30. Right
CR.Ownership Right
Derived Property
Rights
E7. ActivityKia claiming authorship
CR. Activity_TypeTo claim authorship
allows
has_type
performed_by
has_right_type
100% recall, <100% precision
100% precision
Example: Identification of an Attribution Right
Thanks to MetaWare
Provenance: Performing Arts
Thanks to ULeeds and CNRS
Authenticity
Neeri 20091-2 Oct 2009, Helsinki
Neeri 20091-2 Oct 2009, Helsinki
Threat Requirements for solutions
Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved
Ability to create and maintain adequate Representation Information
Non-maintainability of essential hardware, software or support environment may make the information inaccessible
Ability to share information about the availability of hardware and software and their replacements/substitutes
The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity
Ability to bring together evidence from diverse sources about the Authenticity of a digital object
Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future
Ability to deal with Digital Rights correctly in a changing and evolving environment
Loss of ability to identify the location of data An ID resolver which is really persistent
The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future
Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation
The ones we trust to look after the digital holdings may let us down
Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term
Neeri 20091-2 Oct 2009, Helsinki
Threat CASPAR ComponentUsers may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved
RepInfo toolkit, Packager and Registry – to create and store Representation Information.
In addition the Orchestration Manager and Knowledge Gap Manager help to ensure that the RepInfo is adequate.
Non-maintainability of essential hardware, software or support environment may make the information inaccessible
Registry and Orchestration Manager to exchange information about the obsolescence of hardware and software, amongst other changes.
The Representation Information will include such things as software source code and emulators.
The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity
Authenticity toolkit will allow one to capture evidence from many sources which may be used to judge Authenticity.
Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future
Digital Rights and Access Rights tools allow one to virtualise and preserve the DRM and Access Rights information which exist at the time the Content Information is submitted for preservation.
Loss of ability to identify the location of data Persistent Identifier system: such a system will allow objects to be located over time.
The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future
Orchestration Manager will, amongst other things, allow the exchange of information about datasets which need to be passed from one curator to another.
The ones we trust to look after the digital holdings may let us down
The Audit and Certification standard to which CASPAR has contributed will allow a certification process to be set up.
Conclusions Preservation
Is a complex process involves more than just bits and formats metadata is too vague a term Transparency is vital
What is being preserved For whom For how long
OAIS is a good basis for preservation Recursion is an important concept in preservation Preservation threats must be countered by specific
tools and shared infrastructure componentsNeeri 2009
1-2 Oct 2009, Helsinki
Additional links CASPAR:
www.casparpreserves.eu PARSE.Insight:
www.parse-insight.eu Alliance for Permanent Access:
www.alliancepermanentaccess.eu Digital Curation Centre:
www.dcc.ac.uk Audit and certification:
wiki.digitalrepositoryauditandcertification.org OAIS:
http://public.ccsds.org/publications/archive/650x0b1.pdf http://public.ccsds.org/sites/cwe/rids/Lists/CCSDS%206500P11/Overview.aspx
END
Summary What is digital preservation? Transparency What is needed for digital preservation?
• Many strategies– Need to be clear about the scope of each
• Document/rendered object?
• Scientific data – processed/combined to produce new results?
• Other?
– How are all of the threats being addressed?
• What exactly is being preserved?
• For whom is it being preserved? – Designated Community must be specified
– Testability through understandability/usability
• How will it be handed on to future custodians
Umbrella framework Need to integrate in some sense many different
Systems Disciplines Funding Requirements
Projects producing preservation artefacts Representation Information Significant Properties Provenance etc
About researchers
EU 44%, USA 33%, Other 23%
Per category
Data spectrum (R)
Cross-disciplinary use of research data
Sharing of data (R)Did you ever need digital research data gathered by other researchers that was not available?
Sharing of data (R)Do you presently make use of research data gathered by other researchers?
Sharing of data (R)Would you like to make use of research data gathered by other researchers?
Within discipline Outside discipline
Sharing of data (R)How open is your data?
Sharing of data (R)Which constrains do you see in making data open?
Sharing of data (R)How do you locate and access digital research data?
Linking of data (R)As researcher, do you think it is useful to link underlying research to formal literature?
Linking of data (P)Do you link references in your journals to underlying digital research data?
Linking of data (P)Do you as publisher charge separate fees when users want to access data associated with publications?
Linking of data (P)Can authors submit their underlying digital research data with their publication to the publisher?
About fundingResearchers say :
Data managers say :
Publishers say :
Government (national funding)
Government (national funding)
Government (national funding)
Who should pay for data preservation?
Who should pay for preservation of publications?
Researchers say :
Data managers say :
Publishers say :
Government (national funding)
Government (national funding)
Government (national funding)
Who should pay? (P)For preservation of other research output