Auditing Distributed Digital Preservation Networks Prepared for CNI Fall Meeting 2012 Washington, D.C. December 2012 Micah Altman, Director of Research, MIT Libraries Non Resident Senior Fellow, The Brookings Institution Jonathan Crabtree, Assistant Director of Computing and Archival Research HW Odum Institute for Research in Social Science, UNC
This presentation, delivered at CNI 2012, summarizes the lessons learned from trial audits of a several production distributed digital preservation networks. These audits were conducted using the open source SafeArchive system, which enables automated auditing of a selection of TRAC criteria related to replication and storage. An analysis of the trial audits demonstrates both the complexities of auditing modern replicated storage networks, and reveals common gaps between archival policy and practice. Recommendations for closing these gaps are discussed, as are extensions that have been added to the SafeArchive system to mitigate risks in distributed digital preservation (DDP).
Citation preview
1. Prepared for CNI Fall Meeting 2012 Washington, D.C. December
2012Auditing Distributed Digital Preservation Networks Micah
Altman, Director of Research, MIT Libraries Non Resident Senior
Fellow, The Brookings Institution Jonathan Crabtree, Assistant
Director of Computing and Archival Research HW Odum Institute for
Research in Social Science, UNC
2. Collaborators* Nancy McGovern Tom Lipkis & the LOCKSS
Team Data-PASS Partners ICPSR Roper Center NARA Henry A. Murray
Archive Dataverse Network Team @ IQSSResearch Support Thanks to the
Library of Congress, the National Science Foundation, IMLS, the
Sloan Foundation, the Harvard University Library, the Institute for
Quantitative Social Science, and the Massachusetts Institute of
Technology. * And co-conspirators Auditing Distributed Digital
Preservation 2 Networks
3. Related WorkReprints available from: micahaltman.com M.
Altman, J. Crabtree, Using the SafeArchive System: TRAC- Based
Auditing of LOCKSS, Proceedings of Archiving 2011, Society for
Imaging Science and Technology. Altman, M., Beecher, B., &
Crabtree, J. (2009). A Prototype Platform for Policy-Based Archival
Replication. Against the Grain, 21(2), 44- 47. Auditing Distributed
Digital Preservation 3 Networks
4. Preview Why? distributed digital preservation? audit?
SafeArchive: Automating Auditing Theory vs. Practice Round 0:
Calibration Round 1: Self-Audit Round 2: Self-Compliance (almost)
Round 3: Auditing Other Networks Lessons learned: practice &
theory Auditing Distributed Digital Preservation Networks 4
5. Why distributed digital preservation? Auditing Distributed
Digital Preservation Networks 5
6. Slightly Long Answer: Things Go WrongPhysical & Hardware
Software Insider & External Attacks Organizational Failure
Media Auditing Distributed Digital Preservation Networks Curatorial
Error 6
7. Potential Nexuses for Preservation Failure Technical Media
failure: storage conditions, media characteristics Format
obsolescence Preservation infrastructure software failure Storage
infrastructure software failure Storage infrastructure hardware
failure External Threats to Institutions Third party attacks
Institutional funding Change in legal regimes Quis custodiet ipsos
custodes? Unintentional curatorial modification Loss of
institutional knowledge & skills Intentional curatorial
de-accessioning Change in institutional missionSource: Reich &
Rosenthal 2005 Auditing Distributed Digital Preservation 7
Networks
8. The Problem Preservation was once an obscure backroom
operation of interest chiefly to conservators and archivists: it is
now widely recognized as one of the most important elements of a
functional and enduring cyberinfrastructure. [Unsworth et al.,
2006] Libraries, archives and museums hold digital assets they wish
to preserve, many unique Many of these assets are not replicated at
all Even when institutions keep multiple backups offsite, many
single points of failure remain, Auditing Distributed Digital
Preservation Networks 8
9. Why audit?Auditing Distributed Digital Preservation Networks
9
10. Short Answer: Why the heck not? Dont believe in anything
you hear, and only half of what you see - Lou Reed Trust, but
verify. - Ronald Reagan Auditing Distributed Digital Preservation
Networks 10
11. Full Answer:Its our responsibility Auditing Distributed
Digital Preservation Networks 11
12. OAIS Model Responsibilities Accept appropriate information
from Information Producers. Obtain sufficient control of the
information to ensure long term preservation. Determine which
groups should become the Designated Community able to understand
the information. Ensure that the preserved information is
independently understandable to the DC Ensure that the information
can be preserved against all reasonable contingencies, Ensure that
the information can be disseminated as authenticated copies of the
original or as traceable back to the original Makes the preserved
data available to the DC Auditing Distributed Digital Preservation
Networks 12
13. OAIS Basic Implied Trust Model Organization is
axiomatically trusted to identify designated communities
Organization is engineered with the goal of: Collecting appropriate
authentic document Reliably deliver authentic documents, in
understandable form, at a future time Success depends upon:
Reliability of storage systems & services: e.g., LOCKSS
network, Amazon Glacier Reliability of organizations: MetaArchive,
DataPASS, Digital Preservation Network Document contents and
properties: Formats, Metadata, Semantics, Provenance, Authenticity
Auditing Distributed Digital Preservation Networks 13
14. Enhancing Reliability through Trust Engineering Incentives:
Social engineering Rewards, penalties Recognized practices; shared
norms Incentive-compatible mechanisms Social evidence Modeling and
analysis: Reduce provocations Statistical quality control &
reliability Remove excuses estimation, threat-modeling and
Regulatory approaches vulnerability assessment Disclosure; Review;
Certification; Audits Portfolio Theory: Regulations & penalties
Diversification (financial, legal, technical, Security engineering
institutional ) Increase effort for attacker: harden target Hedging
(reduce vulnerability); increase Over-engineering approaches:
technical/procedural controls; , Safety margin, redundancy
remove/conceal targets Informational approaches: Increase risk to
attacker: surveillance, Transparency (release of information
detection, likelihood of response permitting direct evaluation of
Reduce reward: deny benefits, disrupt compliance); common
knowledge, markets, identify property Crypto: signatures,
fingerprints, non- repudiation Auditing Distributed Digital
Preservation Networks 14
15. Audit [aw-dit]: An independent evaluation of records and
activities to assess a system of controlsFixity mitigates risk only
if used for auditing. Auditing Distributed Digital Preservation
Networks 15
16. Functions of Storage Auditing Detect corruption/deletion of
content Verify compliance with storage/replication policies Prompt
repair actions Auditing Distributed Digital Preservation Networks
16
17. Bit-Level Audit Design Choices Audit regularity and
coverage: on-demand (manually); on object access; on event;
randomized sample; scheduled/comprehensive Fixity check &
comparison algorithms Auditing scope: integrity of object;
integrity of collection; integrity of network; policy compliance;
public/transparent auditing Trust model Threat model Auditing
Distributed Digital Preservation Networks 17
18. Repair Auditing mitigates risk only if used for repair.Key
Design Elements Repair granularity Repair trust model Repair
latency: Detection to start of repair Repair duration Repair
algorithm Auditing Distributed Digital Preservation Networks
18
19. Summary of Current Automated Preservation Auditing
StrategiesLOCKSS Automated; decentralized (peer-2-peer);
tamper-resistant auditing & repair; for collection
integrity.iRODS Automated centralized/federated auditing for
collection integrity; micro-policies.DuraCloud Automated;
centralized auditing; for file integrity. (Manual repair by
DuraSpace staff available as commercial service if using multiple
cloud providers.)Digital Preservation In developmentMechanism
Automated; independent; multi-centered; auditing, repair and
provisioning; of existing LOCKSS storage networks; for collection
integrity, for high-level policy (e.g. TRAC) compliance. Auditing
Distributed Digital Preservation Networks 19
20. LOCKSS Auditing & Repair Decentralized, peer-2-peer,
tamper-resistant replication & repairRegularity
ScheduledAlgorithms Bespoke, peer-reviewed, tamper resistantScope -
Collection integrity - Collection repairTrust model - Publisher is
canonical source of content - Changed contented treated as new -
Replication peers are untrustedMain threat models - Media failure -
Physical Failure - Curatorial Error - External Attack - Insider
threats - Organizational failureKey auditing limitations -
Correlated Software Failure - Lack of Policy Auditing,
public/transparent auditing Auditing Distributed Digital
Preservation Networks 20
21. Auditing & RepairTRAC-Aligned policy auditing as a
overlay networkRegularity Scheduled; ManualFixity algorithms Relies
on underlying replication systemScope - Collection integrity -
Network integrity - Network repair - High-level (e.g. trac) policy
auditingTrust model - External auditor, with permissions to collect
meta- data/log information from replication network - Replication
network is untrustedMain threat models - Software failure - Policy
implementation failure (curatorial error; insider threat) -
Organizational failure - Media/physical failure through underlying
replication systemKey auditing limitations Relies on underlying
replication system, (now) LOCKSS, for fixity check and repair
Auditing Distributed Digital Preservation Networks 21
22. SafeArchive: TRAC-Based Auditing & Management of
Distributed Digital PreservationFacilitating collaborative
replication and preservation with technology Collaborators declare
explicit non-uniform resource commitments Policy records
commitments, storage network properties Storage layer provides
replication, integrity, freshness, versioning SafeArchive software
provides monitoring, auditing, and provisioning Content is
harvested through HTTP (LOCKSS) or OAI-PMH Integration of LOCKSS,
The 22 Dataverse Network,Auditing Distributed Digital Preservation
Networks TRAC
23. SafeArchive: Schematizing Policy and Behavior The
repository system must be able to identify thePolicy number of
copies of all stored digital objects, and the location of each
object and their copies.SchematizationBehavior(Operationalization)
Auditing Distributed Digital Preservation Networks 23
24. Adding High-Level Policy to LOCKSS LOCKSS Lots of Copies
Keep Stuff Safe Widely used in library community Self-contained OSS
replication system, low maintenance, inexpensive Harvests resources
via web-crawling, OAI-PMH, database queries, Maintains copies
through secure p2p protocol Zero trust & self repairing What
does SafeArchive Add? Auditing easily monitor number of copies of
content in network Provisioning ensure sufficient copies and
distribution Collaboration coordinate across partners, monitor
resource commitments Provide restoration guarantees Integrate with
Dataverse Network digital repository Auditing Distributed Digital
Preservation Networks 24
25. Design RequirementsSafeArchive is a targeted vertical slice
of functionality through the policy stack Policy Driven status of
participating systems Institutional policy creates formal At least
one system to initiate new replication commitments harvesting on
participating system Documents and supports TRAC No
deletion/modification of /ISO policies objects stored on another
system Allows Asymmetric Schema based auditing used to Commitments
verify collection replication record storage commitments storage
commitments document all TRAC criteria size of holdings being
replicated demonstrate policy compliance distribution of holdings
over time Provide restoration guarantees to owning archive to
replication hosts Limited trust No superuser Partners trusted to
hold the unencrypted content of other (reinforced with legal
agreements) At least one system trusted to read Auditing
Distributed Digital Preservation Networks 25
26. SafeArchive Components Auditing Distributed Digital
Preservation Networks 26
27. SafeArchive in Action safearchive.org Auditing Distributed
Digital Preservation Networks 27
28. Theory vs. PracticeRound 0: Setting up the Data-PASS PLN
Looks ok to me - PHB Motto Auditing Distributed Digital
Preservation Networks 28
29. THEORY StartExpose Content ( Through Install LOCKSS
OAI+DDI+HTTP ) (On 7 servers) Harvest Content (through OAI plugin)
Setup PLN configurations (through OAI plugin) LOCKSS Magic Done 29
Auditing Distributed Digital Preservation Networks
30. Application: Data-PASS Partnership Data-PASS partners
collaborate to identify and promote good archival practices, seek
out at-risk research data, build preservation infrastructure, and
mutually safeguard collections. Data-PASS collections 5 Collections
Updated ~daily Research data as content 25000+ Studies 600000+
Files =3 verified replicas per collection, >= 2 regions Auditing
Distributed Digital Preservation Networks 30
31. Practice (Round 0) OAI Plugin extensions required for:
Non-DC metadata Large metadata Expose Content ( Install LOCKSS
Through Alternate authentication method OAI+DDI+HTTP ) (On 7
servers) Support for OAI-SETS Non-fatal error handling Harvest
Content OAI Provider (Dataverse) tuning: (through OAI plugin)
Performance handling for delivery Performance handling for errors
Setup PLN configurations PLN Configuration required: (through OAI
plugin) Stabilization around LOCKSS versions LOCKSS Coordination
around plugin repository Magic Coordination around collection
definition Dataverse Network Extensions Generate LOCKSS manifest
pages (Theory) License harmonization LOCKSS export control by
archive curator Auditing Distributed Digital Preservation Networks
31
33. Lesson 0 When innovating plan for substantial gap between
prototype and production multiple iterations Auditing Distributed
Digital Preservation Networks 33
34. Theory vs. Practice Round 1: Self-AuditA mere matter of
implementation - PHB Motto Auditing Distributed Digital
Preservation Networks 34
35. THEORY Log Error for Later Investigation (Round 1) LOCKSS
Cache Manager Gather InformationStart from Add Replica Each Replica
Integrate Information -> Map Network State State NO Compare
Current == Network to Policy Policy ? YES Success Auditing
Distributed Digital Preservation Networks 35
36. Implementation www.safearchive.org
37. Practice (Round 1) Gathering information required Replacing
the LOCKSS cache manager Permissions Reverse-engineering UIs (with
help) Gather Information Network magic from Add Replica Each
Replica Integrating information required Heuristics for lagged
information Integrate Information -> Heuristics for incomplete
Map Network State information State Compare == N Heuristics for
aggregated Current State Polic O Map to Policy y information ?
Comparing map to policy required YES Mere matter of implementation
Success (Theory) Auditing Distributed Digital Preservation Networks
37
38. Results (Round 1) Outcomes Implementation of SafeArchive
reporting engine Stand alone OSS replacement for LOCKSS cache
manager Initial audit of Data-PASS replicated collections Problems
Collections achieving policy compliance were actually incomplete
Dude, wheres our metadata? Uh-oh, most collections failed policy
compliance Adding replicas didnt solve it Auditing Distributed
Digital Preservation Networks 38
39. Lesson 1:Replication agreement does not prove collection
integrityWhat you see Replicas X,Y,Z agree on collection AWhat you
are tempted to conclude: Replicas X,Y,Z agree Collection on
collection A A is good Auditing Distributed Digital Preservation
Networks 39
40. What can you infer from replication agreement? Replicas
X,Y,Z agree Collection on collection A Assumptions: A is good
Harvesting did not report errors AND Harvesting system is error
free OR Errors are independent per object AND Large number of
objects in collection Supporting External Evidence Multiple
Systematic Collection Independent Automated Comparison Automated
Restore & Harvester Systematic with External Harvester Log
Comparison Implementations Harvester Testing Collection Monitoring
Testing per Collection Statistics Auditing Distributed Digital
Preservation Networks 40
41. Lesson 2: Replication disagreement does not prove
corruption What you see Replicas X,Y disagree with Z on collection
A What you are tempted to conclude: Repair/Repl CollectionReplicas
X,Y disagree ace A on hostwith Z on collection A Collection A Z is
bad on host Z Auditing Distributed Digital Preservation Networks
41
42. What can you infer from replication failure?Replicas X,Y
disagree Collectionwith Z on collection Assumptions: A on host A Z
is bad Disagreement implies that content of collection A is
different on all hosts Contents of collection A should be identical
on all hosts If some content of collection A is bad, entire
collection is bad Possible alternate scenarios Audit Objects in
information Collections grow collections are cannot be ??? ???
rapidly frequently collected from updated some host Auditing
Distributed Digital Preservation Networks 42
43. Theory vs. PracticeRound 2: Compliance (almost) How do you
spell backup? RE-COVER-Y - Auditing Distributed Digital
Preservation Networks 43
44. Lesson 3: Distributed digital preservation works with
evidence-based tuning and adjustment Diagnostics When network is
out of adjustment additional information is needed to inform
adjustment Worked with LOCKSS team to gather information
Adjustments Timings (e.g. crawls, polls) Understand Tune
Parameterize heuristics, reporting Track trends over time
Collections Change partitioning to AUs at source Extend mapping to
AUs in plugin Extend reporting/policy framework to group AUs
Outcomes At time: Verified replications of all collections
Currently: Minor policy violations in one collection Worked with
LOCKSS team to design further instrumentation of LOCKSS Auditing
Distributed Digital Preservation Networks 44
45. Theory vs. PracticeRound 3: Auditing Other PLNs In theory,
theory and practice are the same in practice, they differ. -
Auditing Distributed Digital Preservation Networks 45
46. Application: Coppul Council of Prairie and Pacific
University Libraries Collections 9 Institutions Dozens of
collections Journal runs Digitized member content: text, photos,
images, ETDS Goal Multiple verified replicas Auditing Distributed
Digital Preservation Networks 46
47. Application: Digital Federal Depository Library Program The
Digital Federal Depository Library Program, or the USDocs private
LOCKSS network replicates key aspects of the United States Federal
Depository System. Collections Dozens of institutions (24
replicating) Electronic publications 580+ collections 10TB,
including audio and video content Testing only, full auditing not
yet performed Auditing Distributed Digital Preservation Networks
48
49. THEORY (Round 3) Gather Information AddStart from Replica
Each Replica NO YES Collection Integrate Adjust Sizes, Information
-> Polling Map Network State Intervals adjusted? State NO
Compare Current == Network to Policy Policy ? YES Success Auditing
Distributed Digital Preservation Networks 49
50. Heres where things get even more complicated Auditing
Distributed Digital Preservation Networks 50
51. Practice (Year 3)Lesson 6: Trust, but continuously verify
20-80 % initial failure to confirm policy compliance Gather Tuning
infeasible, or yielded only Information from Add Replica moderate
improvement Each Replica NO YES Integrate Adjust AU Sizes,
Information -> PollingOutcomes Map Network Intervals In-depth
diagnostic and analysis with State adjusted? State LOCKSS team
Compare Current == NO Adjustment of auditing algorithms: Network to
Policy Policy ? YES detect islands of agreement Adjust expectations
Focus on inferences rather than replication Success agreement Focus
on 100% policy compliance per collection rather than 100%
error-free Design file-level diagnostic instrumentation in
LOCKSSRe-analysis in progress Auditing Distributed Digital
Preservation Networks 51
52. What can you infer from replication failure?Replicas X,Y
disagree Collectionwith Z on collection Assumptions: A on host A Z
is bad Disagreement implies that content of collection A is
different on all hosts Contents of collection A should be identical
on all hosts If some content of collection A is bad, entire
collection is bad Possible alternate scenarios Audit ??? ???
Objects in information Collections grow collections are cannot be
rapidly frequently collected from updated some host Auditing
Distributed Digital Preservation Networks 52
53. What else could be wrong?Round 1 hypothesisDisagreement is
real, but doesnt matter in long run1.1 Temporary differences.
Collections temporarily out or sync (either missing objects or
different object versions) will resolve over time(E.g. if harvest
frequency