View
223
Download
2
Embed Size (px)
Citation preview
US GPOAIP Independence Test
CS 496A – Senior Design
Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong
Faculty advisor: Dr. Russ AbbottGPO contact: Kate Zwaard
Overview Background
US GPO FDsys Project objectives A note on deliverables
Hardware interface File formats (AIP)
METS, MODS, and PREMIS XML parsing Solution Strategy Repositories Testing Conclusion
US GPO The United States Government Printing
Office (GPO) is in charge of producing and archiving documents for every branch of the federal government.
“The U.S Government Printing Office (GPO) provides publishing & dissemination services for the official & authentic government publications to Congress, Federal agencies, Federal depository libraries, & the American public.” (http://www.gpo.gov/about/)
FDsys GPO is developing the Federal Digital
System, a new content management system (CMS) designed to manage all of its digital data.
“The U.S. Government Printing Office’s (GPO) Future Digital System (FDsys) will ingest, authenticate, preserve and provide access to digital content from all three branches of the U.S. Government. FDsys, which is in public beta testing, is intended to preserve digital content free from dependence on specific hardware or software.” (project description)
Project Objectives
“The objective of this project is to test whether the AIPs in FDsys are truly independent of the surrounding content management system. The CSULA team aims to either confirm or reject the claim that, with help from resources commonly available to the digital curation community, an interested party could fully reconstruct the archive using only the content data.”
Project Objectives “GPO will supply a set of content data from its
archival storage. This data will include content files, metadata files (in XML according to the standards referenced above), and METS binding files (in XML) that describe how all of the objects are related. The CSULA team will inspect the information and, using the METS standard, determine whether the information in XML is sufficient for a user to make sense of the data and ingest it to another repository. Because the data is stored in arbitrary folders, scripts would have to be written to assemble the content packages from the locations specified in the METS file.”
Project Objectives This project simulates FDsys breaking down
due to some catastrophic attack or error. We are attempting to categorize and
reconstruct an amount of sample data from FDsys outside the context of the actual CMS. The only references we have available, other than
the actual files in the archive, are publicly defined standards.
It is our hope that this project will help GPO improve the robustness of their file system.
A Note on Deliverables
This is not a typical computer science design project because our aim is not to design software. Instead, we will be conducting scripted tests on real data and forming conclusions based on the results.
Deliverables will most likely include: a written report of our findings and
recommendations a reorganized version of the input data
Hardware Interface
Nothing complicated here. A desktop PC computer or notebook with
USB/FireWire connection. External USB/FireWire hard drive
containing AIPs exported from FDsys.
AIP
Archival Information Package Defines how digital objects and its associated
metadata are packaged using XML based files. METS (binding file) MODS PREMIS
METS In order to reconstruct the archive, we will
need to understand the METS files. METS is schema that provides a flexible mechanism for encoding descriptive, administrative, and structural metadata for a digital library object. The schema is written in xml format. A METS document consists of seven major sections but we will only need to understand the first five sections to recreate the archive. The last two sections are not used by the U.S. Government Printing Office’s (GPO) Future Digital System (FDsys).
METS Schema
1) METS Header The METS Header contains metadata
describing the METS document itself, including such information as creator, editor, etc.
2) Descriptive Metadata The descriptive metadata section may
point to descriptive metadata external to the METS document or contain internally embedded descriptive metadata, or both. This section is link to the MODS xml file.
METS Schema
3) Administrative Metadata The administrative metadata section
provides information regarding how the files were created and stored, intellectual property rights, metadata regarding the original source object from which the digital library object derives, and information regarding the provenance of the files comprising the digital library object. This section is link to the PEMIS xml file.
METS Schema 4) File Section
The file section lists all files containing content which comprise the electronic versions of the digital object. <file> elements may be grouped within <fileGrp> elements, to provide for subdividing the files by object version.
5) Structural Map The structural map is the heart of a METS
document. It outlines a hierarchical structure for the digital library object, and links the elements of that structure to content files and metadata that pertain to each element.
MODS
MODS file will be used to encode descriptive metadata.
A MODS file can be used as an extension schema to METS.
MODS consist of top-elements elements that are mandatory, recommended or optional.
MODS
PREMIS
PREMIS file will be used to encode preservation metadata.
Preservation metadata consists of the following: Provenance Authenticity Preservation activity Technical environment Rights management
PREMIS
PREMIS data model includes of the following: Intellectual Entity Object Entity Event Entity Agent Entity Rights Entity*
Object, Event, and Agent Entities are described using mandatory and optional elements.
PREMIS
XML Parsing As described above, all metadata is in
the form of XML files. Hence, using code to read XML files is integral to the project.
We plan to use the Java programming language for our scripting needs. Java API for XML Processing (JAXP): the
standard Java library for handling XML It provides several different possible
representations for XML
Solution Strategy
Data submitted to us are AIPs, not SIPs. Repository software cannot ingest AIPs, only SIPs. We must write scripts that parse the AIPs in such a way to construct SIPs from the the arbitrary file structure, then ingest those SIPs with a repository software to create to new AIPs.
Repositories We have also looked into third-party
repository software to help parse and organize data. DSpace, Fedora Commons, EPrints
Unfortunately, so far none of them seem ideal for the task.
Testing After parsing and organizing the data, it will
be important to perform checks to ensure that the reconstruction is accurate. We may send a preliminary report to GPO for
verification.
The exact testing procedure is still undefined, as we haven’t had a chance to investigate the data in depth yet. Our goals should be clearer once we understand
exactly what type of data we are dealing with.
Conclusion
Our thanks to Kate, Dr. Abbott, and Dr. Pamula for their support.