Upload
robin-marriner
View
219
Download
2
Tags:
Embed Size (px)
Citation preview
1
Using Alfresco to create an Open Archival Information SystemDr Birgit Plietzsch
Arts Computing Advisor
Swithun Crowe
Developer for Arts and
Humanities Computing projects
&
IT Services, University of St Andrews
2
Structure
1. Introduction to the University of St Andrews Digital Archiving Project (DAP)
2. The DAP Open Archival Information System
3. Developing the OAIS Ingest function in Alfresco
3
Digital Preservation
Digital Preservation is …• the active management of digital information over time to ensure its
accessibility• long-term, error-free storage of digital information, with means for retrieval
and interpretation, for the entire time span the information is required for.• Long-term is defined as "long enough to be concerned with the impacts of changing
technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely”.
• Retrieval means obtaining needed digital files from the long-term, error-free digital storage, without possibility of corrupting the continued error-free storage of the digital files.
• Interpretation means that the retrieved digital files, files that, for example, are of texts, charts, images or sounds, are decoded and transformed into usable representations. This is often interpreted as "rendering", i.e. making it available for a human to access. However, in many cases it will mean able to be processed by computational means.
(Source: Wikipedia)
4
Institutional context
• Legal requirements (e.g. Freedom of Information Act)
• Protection of institutional intellectual property
• Funding body requirements• until 2008 Arts and Humanities Data Service for Arts and
Humanities (national depository for arts and humanities research data)
• no such body exists now for the Arts and Humanities• other subjects national support is patchy
• Moral obligations• protection of cultural and corporate memory
5
Records of the Parliaments of Scotland project
www.rps.ac.uk
• proceedings of the Scottish Parliament from the first surviving act of 1235 to the union of 1707
• 10 years of research• no print publication• c16.5m words• issues:
• inconsistent editorial practices
• obsolescence of software originally used
• long-term sustainability of research data
6
Digital Archiving Project (DAP)
• Pilot project
• Scope:• data contained in electronic resources produced within the Faculty
of Arts, University of St Andrews
• Aims:• ensure long-term sustainability of RPS data• investigate the requirements of digital archiving and obtain
experience• meet funding body requirement• flexible implementation (to allow for additional future uses)
7
The DAP archive
Concepts and Properties of Archives and Hosting in the Strategy and their Relationships ©Charles Beagrie Ltd 2009. CreativeCommons Attribution-Share Alike3.0 Key: solid colour represents core properties and fading colour represents weaker properties of archives and hosting services.
Concepts and Properties of Archives and Hosting in the Strategy and their Relationships
© Charles Beagrie Ltd 2009. CreativeCommons Attribution-Share Alike3.0
8
Structure
1. Introduction to the University of St Andrews Digital Archiving Project (DAP)
2. The DAP Open Archival Information System
3. Developing the OAIS Ingest function in Alfresco
9
The DAP Open Archival Information System
• An Open Archival Information System (or OAIS) is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community.
• reference model: ISO 14721:2003
10
Open Archival Information System: workflows
Seven functions
• Ingest • Archival
Storage • Data
Management • Administration • Preservation
Planning • Access • Management
SIP Submission Information PackageAIP Archival Information PackageDIP Dissemination Information Package
11
Open Archival Information System: data package
Implementation
• Content Information:• XML• TIFF• DOC• Etc
• Preservation Description Information:
• PREMIS
• Descriptive Information:
• MODS
• Packaging Information:
• METS
12
Preservation strategy
• What needs to be preserved?• data• layout• functionality• user experience
• What are the significant properties?• generic low-level properties (e.g. basic data unit, byte-level encoding, data type, and logical schema)• data type specific properties (example: text)
• underlying abstract forms (font, spacing, layout)• sub-properties (e.g. font type, style, family, size, colour)
• How do we preserve?• bit stream preservation• emulation• migration
• Adopted approach:• data is preserved• combination of bit stream preservation and file format migration upon ingest
13
Data models
• description needs of different types of material• electronic resources• digital images • video• research papers• University records• etc.
• introduce flexibility• future wider uses of the archive
14
Electronic resources data model
• expressed in MODS
• 3 layers
• use for pilot
• more models can be developed
Project
Research data
Documen-tation
Code
Resource type
Digital object
Resource Discovery Metadata
15
Approaches investigated
Monolithic approach
• Repository framework: Fedora Commons
• issues with suitable front end for Ingest, Access, Preservation Planning, or Administration functions
• highly customisable
• Metadata• MODS• METS• PREMIS
• DSpace• issues with Archival Storage
and Data Management functions
• EPrints• issues with Administration
and Access functions
• RODA• technical issues
No support for Preservation Planning
Breakdown into OAIS requirements
16
Access
• Plato• Testbed
Implementation of DAP
Software used
• Alfresco• www.alfresco.com
• Fedora Commons
• fedora-commons.org
• Planets Suite• www.openplanets
foundation.org
Archival storage &
Data Management
Management
• Share• Explorer• Records Management
Ingest Preservation Planning
Administration
17
The DAP Open Archival Information System
18
Unresolved issues
• Version control of AIPs• Alfresco / Fedora interaction?
• Access front end• Fedora Commons front ends do not normally support OAIS
functions
• Can extra properties be added to folders and files in Records Management site?
We welcome ideas that might help us resolve the above three issues.
19
Structure
1. Introduction to the University of St Andrews Digital Archiving Project (DAP)
2. The DAP Open Archival Information System
3. Developing the OAIS Ingest function in Alfresco
20
Developing the OAIS Ingest in Alfresco
• FITS and PREMIS• Technical metadata
• RPS and MODS• Resource discovery metadata
• Antivirus scanning• METS
• Wrapping files and metadata
Introduction
21
FITS and PREMIS
• FITS (File Information Tool Set)• http://code.google.com/p/fits/
• Consolidates file format metadata from 3rd party tools• Jhove, DROID, NLNZ ME, Exiftool and others
• Output as XML• PREMIS (PREservation Metadata: Implementation
Strategies)• http://www.loc.gov/standards/premis/
• Data dictionary of semantic units, maps to XML• Transform FITS XML to PREMIS using XSLT
Introduction
22
FITS and PREMIS
• Text property defined in custom aspect for storing FITS XML in node metadata
• Create temporary file containing content of node• Run FITS on temporary file• Put output into custom property• Later on, transform this to PREMIS XML• Can be run as space rule• Compile to AMP using Alfresco SDK
The action
23
FITS and PREMIS
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
<bean id="fits-action-messages" class="org.alfresco.i18n.ResourceBundleBootstrapComponent">
<property name="resourceBundles">
<list><value>alfresco.module.FitsAction.fits-action-messages</value></list>
</property>
</bean>
<bean id="fits-model-bootstrap" parent="dictionaryModelBootstrap" depends-on="dictionaryBootstrap">
<property name="models">
<list><value>alfresco/module/FitsAction/context/fitsModel.xml</value></list>
</property>
</bean>
<bean id="fits-action“ class="uk.ac.st_andrews.repo.action.executer.FitsActionExecuter“ parent="action-executer">
<property name="serviceRegistry"><ref bean="ServiceRegistry"/></property>
</bean>
</beans>
fits-action-context.xml
24
FITS and PREMIS
package uk.ac.st_andrews.repo.action.executer;
public class FitsActionExecuter extends ActionExecuterAbstractBase
{
public void setServiceRegistry(ServiceRegistry serviceRegistry);
protected void addParameterDefinitions(List<ParameterDefinition> paramList);
protected void executeImpl(Action action, NodeRef actionedUponNodeRef);
}
FitsActionExecuter
25
FITS and PREMIS
63 // make sure node exists
64 if (!nodeService.exists(actionedUponNodeRef))
65 {
66 throw new Exception("no node");
67 }
68
69 // make sure that node has fits aspect
70 QName fitsAspect = QName.createQName(fitsURI, "fitsAspect");
71 if (!nodeService.hasAspect(actionedUponNodeRef, fitsAspect))
72 {
73 this.nodeService.addAspect(actionedUponNodeRef, fitsAspect, null);
74 }
75
76 // create new FITS instance
77 Fits fits = new Fits();
78 Fits.allowRounding = true;
79 FitsOutput result = null;
FitsActionExecuter.executeImpl (fragment)
26
FITS and PREMIS
81 // put input into temp file
82 ContentReader reader =
83 contentService.getReader(actionedUponNodeRef, ContentModel.PROP_CONTENT);
84 String fileName =
85 (String) nodeService.getProperty(actionedUponNodeRef, ContentModel.PROP_NAME);
86 File inputFile =
87 TempFileProvider.createTempFile("FitsActionExecuter_", "." + fileName);
88 reader.getContent(inputFile);
89
90 // transform into technical metadata
91 result = fits.examine(inputFile);
92 Document doc = result.getFitsXml();
93
94 // put result of transformation into output
95 XMLOutputter serializer = new XMLOutputter(Format.getPrettyFormat());
96 String output = serializer.outputString(doc);
97
98 // get property to write to
99 QName fitsProp = QName.createQName(fitsURI, "fitsOutput");
100 nodeService.setProperty(actionedUponNodeRef, fitsProp, output);
FitsActionExecuter.executeImpl (fragment cont.)
27
FITS and PREMIS
<identification status="CONFLICT">
<identity format="Microsoft Word" mimetype="application/msword">
<tool toolname="Exiftool" toolversion="8.25" />
<tool toolname="file utility" toolversion="5.04" />
<tool toolname="NLNZ Metadata Extractor" toolversion="3.4GA" />
<tool toolname="ffident" toolversion="0.2" />
</identity>
<identity format="OLE2 Compound Document Format" mimetype="application/octet-stream">
<tool toolname="Droid" toolversion="3.0" />
<externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/111</externalIdentifier>
</identity>
</identification>
Fragment of FITS XML showing conflicting file formats
28
FITS and PREMIS
<premis:format> <premis:formatDesignation> <premis:formatName>Microsoft Word</premis:formatName> </premis:formatDesignation></premis:format><premis:format> <premis:formatDesignation> <premis:formatName>OLE2 Compound Document Format</premis:formatName> </premis:formatDesignation> <premis:formatRegistry> <premis:formatRegistryName>Droid (3.0)</premis:formatRegistryName> <premis:formatRegistryKey>fmt/111</premis:formatRegistryKey> <premis:formatRegistryRole>puid</premis:formatRegistryRole> </premis:formatRegistry></premis:format>
Corresponding fragment of PREMIS XML
29
RPS and MODS
• Records of the Parliaments of Scotland marked up in thousands of XML documents
• http://www.rps.ac.uk
• Using Text Encoding Initiative (TEI) • http://www.tei-c.org/index.xml
• TEI headers contain resource discovery metadata• Extract metadata from documents and populate custom
metadata fields• Can be run as space rule• Compile as AMP using Alfresco SDK
Introduction
30
RPS and MODS
<TEI.2 id="_william_and_mary_t1689_3_6_d6_trans" n="william_and_mary_trans">
<teiHeader>
<fileDesc>
<titleStmt>
<title>A committee appointed for controverted elections</title>
</titleStmt>
<editionStmt>
<edition n="session">william_and_mary_t1689_3_1_d2_trans</edition>
</editionStmt>
<publicationStmt>
<date>16890314</date>
</publicationStmt>
</fileDesc>
</teiHeader>
<text>...</text>
</TEI.2>
TEI example Unique ID for document
Document belongs to translated version of records from reign of William and Mary
Main heading in document
Pointer to session that document belongs to
Date of document, in YYYYMMDD format
31
RPS and MODS
package uk.ac.st_andrews.repo.content.metadata;
public class RPSMetadataExtracter extends org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
{
public RPSMetadataExtracter();
protected Map<String, Serializable> extractRaw(ContentReader reader);
}
RPSMetadataExtracter
32
RPS and MODS
63 // set up parser
64 SAXParser sp = spf.newSAXParser();
65 InputStream cis = reader.getContentInputStream();
66 InputSource is = new InputSource(cis);
67 RPSSaxParser teip = new RPSSaxParser();
68
69 // do parsing
70 teip.setProperties(map);
71 sp.parse(is, teip);
72 map = teip.getProperties();
73
74 // loop over properties found
75 Set s = map.entrySet();
76 Iterator it = s.iterator();
77 while (it.hasNext())
78 {
79 Map.Entry m = (Map.Entry) it.next();
80 putRawValue((String) m.getKey(), (String) m.getValue(), rawProperties);
81 }
RPSMetadataExtracter.extractRaw
33
RPS and MODS
package uk.ac.st_andrews.repo.content.metadata;
public class RPSSaxParser extends org.xml.sax.helpers.DefaultHandler
{
public void setProperties(Map<String, Serializable> prop);
public Map<String, Serializable> getProperties();
public void startElement(String uri, String localName, String qName, Attributes attributes);
public void endElement(String uri, String localName, String qName);
public void characters(char[] ch, int start, int length);
private void handleID(String id);
private void handleDate(String d);
}
RPSSaxParser
34
RPS and MODS
// property names
21 private static final String KEY_ID = "rpsID";
22 private static final String KEY_REIGN = "rpsReign";
23 private static final String KEY_VERSION = "rpsVersion";
24 private static final String KEY_HEADING = "rpsHeading";
25 private static final String KEY_SESSION = "rpsSession";
26 private static final String KEY_DATE = "rpsDate";
27 private static final String KEY_TITLE = "cmTitle";
// some properties get set in RPSSaxParser.characters
185 if (true == inTitle)
186 {
187 rawProperties.put(KEY_TITLE, new String(ch, start, length));
188 }
189 else if (true == inSession)
190 {
191 rawProperties.put(KEY_SESSION, new String(ch, start, length));
192 }
RPSSaxParser
35
RPS and MODS
# Namespaces
namespace.prefix.rps=http://www.rps.ac.uk/ns/1.0
namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
# Mapping of property names to Qualified names used in model
rpsID=rps:id
rpsReign=rps:reign
rpsSession=rps:session
rpsDate=rps:date
rpsVersion=rps:version
rpsHeading=rps:heading
cmTitle=cm:title
RPSMetadataExtracter.properties
36
RPS and MODS
<aspect name="rps:metadata">
<title>RPS Metadata</title>
<properties>
<property name="rps:id"><type>d:text</type></property>
<property name="rps:reign"><type>d:text</type></property>
<property name="rps:session"><type>d:text</type></property>
<property name="rps:date"><type>d:text</type></property>
<property name="rps:heading"><type>d:text</type></property>
<property name="rps:version"><type>d:text</type></property>
</properties>
</aspect>
rpsModel.xml (fragment showing aspect)
37
RPS and MODS
# I18N strings
rpsID=RPS ID
rpsReign=RPS Reign
rpsSession=RPS Session
rpsDate=RPS Date
rpsVersion=RPS Version
rpsHeading=RPS Heading
webclient.properties
38
RPS and MODS
• Metadata Object Description Schema • http://www.loc.gov/standards/mods/
• MODS is a resource discovery metadata standard• Working on defining MODS data models
• For Project, Resource Type and Digital Object levels
• Will move RPS metadata into MODS fields
Using MODS
39
Antivirus Action
• Creates an action for scanning files for viruses• Uses ClamAV
• http://www.clamav.net/lang/en/
• Can be configured for other tools• Emails creator of file if virus found• Deletes file from repository if virus found• Can be run as space rule• Compile as AMP using Alfresco SDK
Introduction
40
Antivirus Action
antivirus-action.xml (fragment)
<bean id="antivirus-action" class="uk.ac.st_andrews.repo.action.executer.AntivirusActionExecuter" parent="action-executer">
<!– services needed by bean -->
<property name="contentService“><ref bean="contentService" /></property>
<property name="nodeService"><ref bean="nodeService" /></property>
<property name="templateService"><ref bean="templateService" /></property>
<property name="actionService"><ref bean="actionService" /></property>
<property name="personService"><ref bean="personService" /></property>
<!– person that email will come from, defined in alfresco-golbal.properties -->
<property name="fromEmail">
<value>${antivirus.mailer}</value>
</property>
<!– path to Freemarker template, defined in alfresco-golbal.properties -->
<property name="emailTemplate">
<value>${antivirus.template}</value>
</property>
41
Antivirus Action
antivirus-action.xml (fragment, cont.)
<property name="command">
<bean class="org.alfresco.util.exec.RuntimeExec">
<property name="commandMap">
<map>
<!– command to run, ${antivirus.exe} set in alfresco-golbal.properties, ${source} in Java class -->
<entry key=".*" value="${antivirus.exe} ${source}"/>
</map>
</property>
<property name="errorCodes">
<value>1</value><!– exit code 1 indicates that virus was found -->
</property>
</bean>
</property>
</bean>
42
Antivirus Action
AntivirusActionExecuter
package uk.ac.st_andrews.repo.action.executer;
public class AntivirusActionExecuter extends ActionExecuterAbstractBase
{
public void setContentService(ContentService contentService);
public void setNodeService(NodeService nodeService);
public void setTemplateService(TemplateService templateService);
public void setActionService(ActionService actionService);
public void setPersonService(PersonService personService);
public void setFromEmail(String fromEmail);
public void setCommand(RuntimeExec command);
public void setEmailTemplate(String emailTemplate);
public void init();
protected void addParameterDefinitions(List<ParameterDefinition> paramList);
protected void executeImpl(final Action ruleAction, final NodeRef actionedUponNodeRef);
}
43
Antivirus Action
AntivirusActionExecuter.executeImpl (fragment)
135 // put content into temp file
136 ContentReader reader =
137 contentService.getReader(actionedUponNodeRef, ContentModel.PROP_CONTENT);
138 String fileName =
139 (String) nodeService.getProperty(actionedUponNodeRef, ContentModel.PROP_NAME);
140 File sourceFile =
141 TempFileProvider.createTempFile("anti_virus_check_", "_" + fileName);
142 reader.getContent(sourceFile);
143
144 // set source property for command
145 Map<String, String> properties = new HashMap<String, String>(1);
146 properties.put(VAR_SOURCE, sourceFile.getAbsolutePath());
147
148 // execute the transformation command
149 ExecutionResult result = null;
150 try
151 {
152 result = command.execute(properties);
153 }
154 catch (Throwable e)
155 {
156 throw new AlfrescoRuntimeException("Antivirus check error: \n" + command, e);
157 }
44
Antivirus Action
AntivirusActionExecuter.executeImpl (fragment, cont.)
165 // try to get document creator's details
166 String creatorName = (String) nodeService.getProperty(actionedUponNodeRef,
167 ContentModel.PROP_CREATOR);
168 if (null == creatorName || 0 == creatorName.length())
169 {
170 throw new Exception("couldn't get creator's name");
171 }
172
173 NodeRef creator = personService.getPerson(creatorName);
174 if (null == creator)
175 {
176 throw new Exception("couldn't get creator");
177 }
178
179 String creatorEmail = (String) nodeService.getProperty(creator,
180 ContentModel.PROP_EMAIL);
181 if (null == creatorEmail || 0 == creatorEmail.length())
182 {
183 throw new Exception("couldn't get creator's email address");
184 }
45
Antivirus Action
AntivirusActionExecuter.executeImpl (fragment, cont.)
186 // put together message
187 Map<String, Object> model = new HashMap<String, Object>(8, 1.0f);
188 model.put("filename", fileName);
189 model.put("message", result);
190
191 String emailMsg = templateService.processTemplate("freemarker", emailTemplate, model);
192
193 // send email message
194 Action emailAction = actionService.createAction("mail");
195 emailAction.setParameterValue(MailActionExecuter.PARAM_TO, creatorEmail);
196 emailAction.setParameterValue(MailActionExecuter.PARAM_FROM, fromEmail);
197 emailAction.setParameterValue(MailActionExecuter.PARAM_SUBJECT,
198 "Virus found in " + fileName);
199 emailAction.setParameterValue(MailActionExecuter.PARAM_TEXT, emailMsg);
200 emailAction.setExecuteAsynchronously(true);
201 actionService.executeAction(emailAction, null);
202
203 // delete node
204 nodeService.addAspect(actionedUponNodeRef, ContentModel.ASPECT_TEMPORARY, null);
205 nodeService.deleteNode(actionedUponNodeRef);
46
METS and Fedora Commons
• Metadata and Encoding Transmission Standard (METS)• http://www.loc.gov/standards/mets/
• METS is a wrapper for other metadata documents• Plan to generate METS documents containing/referencing:
• Ingested files• Renderings of these files (thumbnails, reference copies, archival
formatted versions etc.)• Resource discovery metadata• Technical metadata
• Fedora Commons can ingest METS documents as SIPs• http://fedora-commons.org/
Introduction
47
Find out more
• FITS in Alfresco• http://forge.alfresco.com/projects/fitsinalfresco/
• RPS Metadata Extracter• http://forge.alfresco.com/projects/rpsmetadata/
• Antivrus• http://forge.alfresco.com/projects/antivirus/
• http://www.st-andrews.ac.uk/itsupport/academic/arts
Project source code available on Alfresco Forge
University of St Andrews Digital Archiving Project
48
Using Alfresco to create an Open Archival Information SystemDr Birgit Plietzsch
Arts Computing Advisor
Swithun Crowe
Developer for Arts and
Humanities Computing projects
&
IT Services, University of St Andrews