Building the Digital Repository of Ireland Infrastructuredri.ie/sites/default/files/files/building-dri-infrastructure.pdf · Peter Tiernan, Systems and Storage Engineer, Trinity College

Building the Digital Repository

of Ireland Infrastructure

www.dri.ie

Contributors:

Dr. Kathryn Cassidy, Software Engineer, Trinity College Dublin

Dr. Sandra Collins, Director Digital Repository of Ireland, Royal Irish Academy

Fabrizio Valerio Covone, Software Engineer, Dublin Institute of Technology

Dermot Frost, Manager, Trinity College Dublin

Damien Gallagher, Senior Software Engineer, Maynooth University

Dr. Stuart Kenny, Software Engineer, Trinity College Dublin

Eoin Kilfeather, Research Team Leader, Dublin Institute of Technology

Dr. Agustina Martínez García, Postdoctoral Researcher, Maynooth University

Charlene McGoohan, Requirements Manager, Maynooth University

Jenny O’Neill, DRI Data Curator, Trinity College Dublin

Sinéad Redmond, Software Engineer, Maynooth University

Jimmy Tang, Senior Software Engineer, Trinity College Dublin

Peter Tiernan, Systems and Storage Engineer, Trinity College Dublin

Dr. Sharon Webb, Knowledge Transfer Manager, Royal Irish Academy (former Requirements Analyst)

Edited by:

Rebecca Grant, Digital Archivist, Royal Irish Academy

Jenny O’Neill, DRI Data Curator, Trinity College Dublin

Dr. Sharon Webb, Knowledge Transfer Manager, Royal Irish Academy (former Requirements Analyst)

DOI: http://dx.doi.org/10.3318/DRI.2015.5

Contents

Director’s Foreword — 3

The Digital Repository of Ireland — 5

Executive summary — 6

1. Introduction — 7

2. Requirements — 13

3. Architecture of the System — 17

4. Core Technology Choices — 22

5. Storage — 26

6. Metadata and Data Modelling — 30

7. User Interface Design — 47

8. Conclusion — 56

Bibliography — 57

Appendix 1 — 61

Appendix 2 — 62

3

Director’s Foreword

The Digital Repository of Ireland is a national digital preservation repository for human-

ities and social sciences data. We started building the DRI e-infrastructure in 2011 at

the inception of our project, and from the very beginning it has been a key priority for

DRI to inform and be informed by international best practice. When we conducted

research into what other national or large-scale digital repositories had chosen for their

architecture, technologies and methodologies, we found that many organisations did

not document and publish their technology choices. It is our ethos in DRI to openly

share best practice with the wide community in the hope that we may all benefit from

making informed choices based on commonly adopted standards and exemplars of

good practice. Hence this report, which documents and explains the design method-

ology, the technology choices, the standards and reference models that we have

implemented in the DRI.

A core principle of DRI is Open Access, and our choices of software and platforms have

reflected this core principle; it is important to us, as a publicly funded endeavour, to

implement community supported software, and to contribute to the community, and

that is why we work with Hydra, Fedora, Solr and other open source code bases.

We don’t present our infrastructure as the perfect solution or the only solution, but it

is our working solution that we expect to demonstrate robustness, scalability and per-

formance for our users, data depositors and partners. If you are designing a repository

solution for your institution, your discipline or indeed your country, we recommend a

detailed requirements analysis, and on the basis of that analysis make informed tech-

nology choices that meet your needs, whilst giving consideration to the

learnings and implementations of the community and how you too

might contribute into the global knowledge in this area.

I must thank our amazing infrastructure team in the DRI, past and

present, who have worked tirelessly over the last three years to

deliver the DRI repository. They have worked within a highly

interdisciplinary team, and demonstrated great under-

standing and perseverance with the often complex

and challenging requirements of a broad human-

ities and social sciences research and

information management community.

4

We publish this report knowing that our work is not completed: a digital repository is

a living, growing organism, that will face new challenges as the huge quantities of data

grow even faster, as the complexity of the data increases, and as the requirements for

sophisticated data visualisation, open data and research data management grow. This

report is a snapshot in time of our development to date and I look forward to the next

iterations that will reflect the enhanced technologies and approaches that will no doubt

develop over the next years.

Dr. Sandra Collins

Director, Digital Repository of Ireland

The Digital Repository of Ireland

The Digital Repository of Ireland is an interactive, trusted digital repository for social and

cultural content held by Irish institutions. By providing a central Internet access point

and interactive multimedia tools, DRI facilitates engagement with contemporary and

historical data, allowing the public, students and scholars to research Ireland’s cultural

heritage and social life. As a national digital infrastructure, DRI is working with a wide

range of institutional stakeholders to link together and preserve Ireland’s rich and varied

humanities and social science data.

The DRI was established in 2011, when it received funding from the Irish Government’s

PRTLI cycle 5 for €5.2M over four years. The DRI consortium is comprised of the following

partners: The Royal Irish Academy (Lead Institution), Maynooth University (MU), Trinity

College Dublin (TCD), Dublin Institute of Technology (DIT), National University of Ireland

Galway (NUIG), and National College of Art and Design (NCAD). The DRI is currently

collaborating with a network of cultural, social, academic and industry partners, includ-

ing the National Library of Ireland (NLI) and the Irish National Broadcaster RTÉ.

5

6

Executive summary

The Digital Repository of Ireland (DRI) is the national trusted digital repository for Ireland’s

social and cultural data. The repository links together and preserves both historical and

contemporary data held by Irish institutions, providing a central internet access point

and interactive multimedia tools. The successful implementation of DRI’s fundamental

goal, to build an interactive trusted digital repository (TDR), depends upon the system’s

ability to implement and satisfy identified user needs.

The architecture of the DRI infrastructure was designed to meet all of the requirements

of the project and is broken into several different components, including a storage

cluster, physical virtual machine (VM) servers and a virtualised environment to host the

various services required to deliver the repository functionality.

In order to implement the DRI infrastructure a number of technology choices were made.

A framework that occupies a middle-ground between a full repository solution and the

basic components of a system was utilised. The Hydra framework was selected with

Fedora being used for storing and managing ingested digital objects together with Solr

for search and discovery and Blacklight providing the interfaces for user browsing, ingest

and management functions.

In order to develop a Trusted Digital Repository it was necessary to build a robust and

scalable storage solution. DRI decided to focus on software defined storage (SDS) solu-

tions because of the cost, flexibility and sustainability they can provide, with Ceph being

chosen for its selection of interfaces and APIs for many programming languages, as well

as a rich array of libraries.

The underlying data management layer must be capable of dealing with a wide variety

of digital collections. DRI encourages the use of best practice and commonly used data

and metadata formats and standards which are stored in the Repository as collections

of Fedora digital objects. To facilitate the creation, management and storage of digital

objects1, the Repository data models were implemented as a Ruby gem which makes

use of ActiveFedora. ActiveFedora also facilitates the specification of relationships

between digital objects.

As well as managing and preserving digital objects the Repository provides access to

these objects. The Repository’s user interfaces are realised using Ruby On Rails views and

JavaScript framework JQuery. The complex adaptive layout of the Repository makes use

of the SASS language and a grid framework called Zen Grids. Interactive multimedia

tools include timeline and mapping visualisations.

1 A digital object refers to a file or digital asset and its associated metadata e.g. an image, a document, anaudio file and all its accompanying files, including xml encodings.

7

1. Introduction

A trusted digital repository (TDR) is “one whose mission is to provide reliable, long-term

access to managed digital resources to its designated community, now and in the future”

(Research Libraries Group and OCLC, 2002, p.5). The Digital Repository of Ireland (DRI)

is committed to linking together and preserving both historical and contemporary data

held by Irish institutions and providing a central internet access point and interactive

multimedia tools for use by the public, students and scholars.

This report outlines the approach taken by DRI in addressing the technological challenges

of building a TDR and it is hoped it will be of assistance to others who are developing

digital repositories. It is aimed at a technical audience who are interested in the technical

decisions made by the development team, but it may also be of interest to information

professionals hoping to roll out similar programs in their organisation.

DRI was funded by the Higher Education Authority through its Programme for Research

in Third Level Institutions (PRTLI)2, Cycle 5 (2011-2015). The project is federated across

six academic institutions, Royal Irish Academy (RIA), Maynooth University (MU), Trinity

College Dublin (TCD), Dublin Institute of Technology (DIT), National University of Ireland

Galway (NUIG), and National College of Art and Design (NCAD). Since 2011 DRI has

been awarded additional funding through Department of Arts, Heritage and the

Gaeltacht, Enterprise Ireland, Irish Research Council, Science Foundation Ireland, among

others. A report on long-term sustainability of digital repositories has also been published

by DRI and discussions are on-going with various agencies to address DRI’s sustainability

(Kitchin, Collins, & Frost, 2015).

The consortium brings together researchers in computer science and software develop-

ment, requirements engineering, UI design and experience, digital archiving, stakeholder

engagement and community outreach, policy development and project management,

and academics in the Humanities and Social Sciences (HSS). The project team work col-

laboratively to deliver a robust, digital platform that supports long-term digital

preservation for deposited digital objects and provides sustainable access. The team also

develops policies which directly inform the ongoing development and management of

the Repository, which include but are not limited to areas such as copyright and licensing,

data protection and digital preservation strategy (data citation, persistent identifiers,

media formats and storage). The development of guidelines and training workshops, to

engage with the DRI’s designated community, is also an integral part of the team’s work

and an important mechanism to drive national best practice in the area of long-term

digital preservation and data management.

2 http://www.hea.ie/en/funding/research-funding/programme-for-research-in-third-level-institutions, lastaccessed 6 June 2015.

1.1. Project Structure

The work program of the DRI is organised into multiple Strands and Work Packages

(WP).3 These are:

Strand 1 Management – Long Term Planning (WP1) and Project Management (WP2)

Strand 2 Context – Requirements Analysis (WP3) and Policies and Guidelines (WP4)

Strand 3 Design and Implementation – System Architecture (WP5), User Interface

Tools (WP6), Data Management (WP7) and Storage and Preservation (WP8)

Strand 4 Rollout - Support, Access & Development – User Support, Training and

Advocacy (WP9) and Demonstrator Projects (WP10)

While this report primarily focuses on the activities of Strand 3 and the Requirements

Analysis Work Package (WP3), the development of the DRI infrastructure was carried

out with input from all Strands and Work Packages as well as the various Task-forces

and Working Groups within the project.

In parallel to the development work on the Repository, the DRI has produced a number

of reports and guidelines on a range of topics including a review of digital archiving in

Ireland (2012), an analysis of international approaches to caring for digital content

(2013), guidelines for the use of metadata standards in the Repository (2015), fact sheets

on metadata quality control and long-term digital preservation, and specific legal doc-

umentation.4 The DRI’s reports, guidelines and fact sheets have influenced the

development of the Repository and constitute an important output from the DRI’s

Strands and Work Packages.

The DRI’s Demonstrator Projects (WP10) were designed to test various technical and

policy components of the Repository as well as to showcase some of the HSS datasets

which are managed by the consortium’s partner institutions. Five demonstrator projects

were included in the DRI project plan. Each project brought unique data types, copyright

challenges, access and metadata requirements and provided necessary context for the

development of the Repository.

The Clarke Studio Archive (TCD) consists of hundreds of drawings, notebooks and busi-

ness records of the Harry Clarke stained glass studio. This collection consists of digitised

text and images. Irish Lifetimes (MU) is a collection of oral and life histories that reflect

the changing lives and times in Ireland from the foundation of the State. These life his-

tories are captured in digital audio files together with transcriptions in text documents.

8

3 http://www.dri.ie/about, last accessed 6 June 2015.4 http://www.dri.ie/publications, last accessed 6 June 2015.

9

Kilkenny Design Workshops Archive (NCAD) features a collection of digitised photo-

graphs, press cuttings and booklets recording the history of the Workshops and their

impact on design in Ireland. The Irish Language and Cultural Heritage Project (NUIG)

focuses on particular aspects of Irish cultural heritage, depicted through the medium of

the Irish language. The NUIG collections contain digitised texts, images, audio and video

which are drawn from a number of sources, including the University’s own Irish language

archives, as well as contributions by other external partners, including RTÉ Raidió na

Gaeltachta. The Letters of 1916 (MU) is a crowd sourced collection of letters written

around the time of the Irish 1916 Easter Rising. These have now been uploaded, tran-

scribed and encoded by members of the public.

1.2. DRI and OAIS

The project structure of DRI was influenced by the Reference Model for an Open Archival

Information System (OAIS). An OAIS consists “of an organization, which may be part of

a larger organization, of people and systems that has accepted the responsibility to pre-

serve information and make it available for a designated community” (Consultative

Committee for Space Data Systems, 2012). The Reference Model establishes a common

framework of terms and concepts which constitute a digital repository and lays out the

functional components and responsibilities of such a repository at an organisational level.

The Strands and Work Packages of DRI can be mapped to the OAIS reference model as

shown in Figure 1.1. The reference model recognises the interaction of producers of

digital objects, management, the archive (or repository) itself and consumers of the

digital objects. The archive consists of administration and preservation planning as well

as functional entities: ingest, archival storage, data management and access.

Strand 1 are responsible for the long-term planning and project management for the

Repository. Preservation planning is undertaken across the Strands, in particular by the

System Architecture Work Package (WP5), the Data Management Work Package (WP7)

and the Storage and Preservation Work Package (WP8). The Demonstrator Projects Work

Package (WP10) are the producers of digital objects which are ingested into the

Repository. The archival storage is the responsibility of the Storage and Preservation

Work Package (WP8), while data management is the remit of the Data Management

Work Package (WP7). User interface design and development to facilitate access is the

responsibility of the User Interface Tools Work Package (WP6).

10

Figure 1.1: A mapping of DRI Strands and Work Packages to the OAIS reference model

1.3. Development Approach

As DRI was a new project the team did not have to be concerned with maintaining

legacy systems or methodologies. The team adopted various agile techniques and

methods and repeated those that appeared to work for DRI.

The agile methodology is an iterative approach to software development which means

that requirements and solutions are not fixed at the start of the project, but rather evolve

throughout the life of the project. Agile approaches emphasise collaborative planning,

rapid prototyping and early delivery of features, continuous improvement, and encour-

ages rapid and flexible response to change5.

DRI’s work plan established that the Requirements Analysis and Policies and Guidelines

Work Packages as well as the Design of Software Architecture Work Package began

work at the same time. This created a potential conflict between the need of the

Requirements Manager to define authentic user-focused requirements and the software

development team’s need to begin development of a technical solution in the absence

of such requirements. However, as some of DRI’s high-level business requirements were

well defined from the outset of the project gave the software development team a start-

ing point. Rapid prototyping was used to demonstrate features of the Repository

ProducerWP10

Administration (Strand 1)

Preservation Planning (Strand 2 WP5, 7, 8)

Ingest (WP10) Archival storage (WP8)

Data management (WP7)

Access (WP6)

Consumer

5 http://agilemanifesto.org/, last accessed 6 June 2015.

11

software. This allowed feedback from the Repository’s stakeholders to inform both the

requirements and development processes.

The use of an agile approach enabled the technical team to be responsive to changes

in the project schedule. For example, supporting external curation tools developed by

Inspiring Ireland6 required a re-prioritisation of features in order to develop the API func-

tionality. Despite this change in focus, the disruption caused to the development effort

was minimal given the agile methodology used.

At each new iteration or sprint of development a subset of requirements were prioritised

by Strand 3 and Strand 2 at DRI’s monthly cross-strand meetings. This forum enabled

high-level feature specification to be carried out in collaboration with stakeholders and

the demonstrator projects.

As a distributed development team, with responsibilities for different Work Packages

shared among multiple consortium partners, a daily “stand-up” meeting by teleconfer-

ence was essential. This provided a mechanism for the team to communicate, maintain

an efficient working relationship and ensure issues, that could impact schedules, were

identified and dealt with quickly.

Tools, including GIT, were used to assist the technical team in working collaboratively to

produce code in an effective manner. GIT, a shared source code version control system

was deployed to allow easy forking and merging of the code bases. Periodic code

reviews were introduced to ensure quality and consistency of the code being developed

across the consortium partners.

1.4. Report Structure

This infrastructure report details the various stages of development undertaken by DRI.

It considers the work of the different Work Packages within Strand 3 as well as the

Strand 2 Work Package, Requirements Analysis. Chapter 2, Requirements, details the

requirements strategy developed by DRI and outlines the requirements gathering and

analysis processes carried out by DRI.

The architecture of the DRI infrastructure was designed to meet all of the requirements

of the project . This is outlined in Chapter 3, where both the system and application

structures are described. This chapter includes a general overview of the DRI architecture

as well as the architecture of the DRI application and the deployment architecture.

6 http://www.inspiring-ireland.ie/, last accessed 6 June 2015.

The core technology choices made to implement the Repository are then discussed.

These choices were informed by the requirements gathering process and the architecture

that DRI was aiming to implement. Key characteristics that influenced the decisions

included the scalability and extensibility of the solutions examined, as well as ease of

customisation, the system’s ability to add data types and metadata standards and the

range of storage options available.

The DRI approach to distributed and federated storage is addressed in Chapter 5. This

includes a discussion of software defined storage (SDS) and solutions that were inves-

tigated by DRI. It also looks at the storage architecture, storage integrity and trust and

migration of the entire storage solution should this become necessary if the system

reaches the limits of its scale or the end of its supported life.

As part of the DRI’s goal of building a TDR the underlying data management layer

needed to be capable of dealing with a wide variety of digital collections. At the same

time the data management layer was required to provide a set of data models that

allow for the preserved data and metadata to be accessed and analysed by new tools

developed in the future. Chapter 6 includes a discussion of the DRI data model and the

metadata standards supported by it and the solutions developed by the Data

Management Layer Work Package.

Chapter 7 outlines the design and implementation of the user interface. It also gives

insight to the technologies used to implement the architecture of the UI for the

Repository.

12

2. Requirements

The successful implementation of DRI’s fundamental aim, to build an interactive trusted

digital repository (TDR), is dependent upon the system’s ability to implement its stated

requirements and satisfy identified user needs. From the outset DRI committed to devel-

oping a system that reflects the authentic needs of its designated and diverse community

of users.

The methodology for developing DRI’s requirements was centred on a number of basic

but important premises: – that the requirements are:

driven by and focused on the user.

considered within the humanities and social science problem domain.

iterative and require input and feedback from the software development team, the

Policy and Guidelines Work Package and end users (through demonstrator projects).

flexible and dynamic without leading to software drift or requirements creep.

These assertions are reflective of DRI’s mission to be user and community focused.

2.1. Requirements analysis and specification in DRI

In 2012 DRI released its first national report, ‘Digital Archiving in Ireland: National Survey

of the Humanities and Social Sciences’, which published findings and observations on a

number of issues including preservation, storage and formats, metadata and interoper-

ability, as well as user tools and content management.7 This report was based on the

requirements interviews DRI carried out to inform and develop DRI’s requirements and

policy statements. It was the first analysis of these interviews and provided the means

to explore more thoroughly the issues, concerns and problems raised by DRI’s community.

The interviews, which informed the 2012 report, and their analysis, are reflective of DRI’s

requirements process for the period 2011-2013; the principal output of which was the

formal specification of requirements which underpin, inform and prioritise the software

development effort described in this report. DRI’s formalised requirements statements

are the accumulative effort of the elicitation, analysis and specification phases of the

requirements process.

DRI’s requirements range from high-level business requirements to functional require-

ments that detail user interactions with the repository and its content. The first and most

important requirement, that the repository should be a TDR and should provide a range

of research tools, came directly from the project’s strategic objectives.

13

7 O’Carroll, A and Webb, S., ‘Digital Archiving in Ireland: National Survey of the Humanities and SocialSciences’ (2012) available http://dri.ie/digital-archiving-in-ireland-2012.pdf, last accessed 6 June 2015.

14

As well as being influenced by the OAIS reference model (see Section 1.2), DRI adopted

the Data Seal of Approval (DSA, Data Seal of Approval Board, 2013) as a guideline for

its policies, and also consulted ISO 163638 standard for Trusted Digital Repositories,

which is based on the Trustworthy Repositories Audit & Certification (TRAC, Online

Computer Library Center & Center for Research Libraries, 2007) for additional guidance.

These factors introduced a number of requirements specifically related to the provision

of a TDR, such as the requirement to generate checksums of ingested files, to have

backup and disaster recovery processes in place, and to generate audit reports.

Requirements were informed by the legislative framework in which the repository would

operate. These included requirements related to the EU Cookie Directive (European

Parliament, Council of the European Union, 2009), Copyright and Intellectual Property

law (Copyright and Related Rights Act, 2000) and support for Freedom of Information

requests (Freedom of Information Act, 1997).

As part of the requirements gathering process a review of emerging approaches to

caring for digital data (O’Carroll, Collins, Gallagher, Tang & Webb, 2013) was performed

to examine national and international best practices in the field. This informed not only

requirements and policy, but also the technical and software decisions discussed in the

following chapters of this report.

As stated, DRI is committed to satisfying authentic user needs. In order to achieve this

it was necessary to determine the target community and identify the different actors

and users of the system (e.g. depositor, collection manager, expert user, novice user,

public user, third level researcher, external repositories, etc.). Different stakeholder per-

sonas were developed through stakeholder and user group analysis as well as from

interviews, resulting in the development of use cases and use case scenarios. DRI’s

demonstrator projects also provided an excellent opportunity for the requirements, Policy

and Guidelines and development Work Packages to engage with authentic users of the

system. This process generated open discussion on user needs and revealed important

user and functional requirements related to basic operations that users expected to be

able to perform on the repository. Ingestion, editing, creating collections, publishing

and others are examples of such functional requirements.

In addition to requirements that mapped to specific activities within the repository, there

were non-functional requirements, such as the requirements for the repository to be

robust, scalable and secure. The DRI storage and performance metrics survey helped to

identify the typical storage and performance requirements of our stakeholders.

8 http://www.iso.org/iso/catalogue_detail.htm?csnumber=56510, last accessed 6 June 2015.

15

Performance testing of the system then allowed the project to measure success in

meeting these requirements.

In order to be successful any software product must consider usability issues. These intro-

duced further requirements such as the need to support common browsers, to be

accessible, and to work on multiple devices (PC, smartphone, tablet, etc.).

Some requirements related specifically to DRI’s intention to interact with Irish and EU

projects, these included a REST Application Programming Interface (API) to support har-

vesting of digital objects from DRI.

2.2. Requirements management in DRI

The overall requirements engineering process is iterative and entails ‘multiple cycles’

between requirements elicitation, analysis, specification and, of course, validation

(Wiegers, 2006, p9). The use of a central requirements management system (RMS),

CaseComplete9, supported this iterative model and ensures that the requirements were

accessible across the Strands. The RMS also assisted transparency and traceability within

the requirements process, an important feature given the nature and scope of DRI.

DRI’s RMS was managed, updated and populated by DRI’s Requirements Analyst and

subsequently the Requirements Manager. This ensured continuity and was essential to

the requirements management strategy. Contributions, feedback and comments on

requirements, from the various team members in all Strands, were also an essential

feature of the iterative requirements process. Regular requirements reviews and updates

were given at cross-strand meetings and in the last year of development a bimonthly

review was carried out with members of the development team to review the imple-

mentation status of the functional requirements. A change history was documented

within each requirement statement in the system. No requirements were deleted -

instead redundant requirements were tagged obsolete and remained in the system for

continuity, demonstrating the evolution of the requirements and the Repository.

2.3. Requirements implementation

In order to ensure that the development work proceeded directly from the project

requirements, each high-level requirement had to map to a set of features that could

be implemented within the repository application.

9 http://casecomplete.com/, last accessed 6 June 2015.

The requirements were thus translated into executable requirements or specifications.

Described as ‘outside-in development’ (Wynne & Hellesøy, 2012, p. 18) these executable

specifications were composed of step-by-step descriptions of how tasks would be com-

pleted within the Repository. The executable specifications formed not only a detailed

specification of the Repository’s functionality, but they also defined the acceptance tests

for the system. Acceptance tests are expressions of ‘what the software needs to do in

order for the stakeholder to find it [the system] acceptable’ (Wynne & Hellesøy, 2012,

p. 4). This is part of the behaviour driven development (BDD, Chelimsky et al, 2010)

approach to software development that focuses on cooperation between the stake-

holder and the software development team. BDD is part of the agile software

development methodology.

The Cucumber10 framework was used to translate the project requirements into the

executable specifications and acceptance tests for the developers in DRI. An example of

such a cucumber feature file is given in Appendix 1. The Cucumber feature files were

then augmented with RSpec11 ‘step definitions’ that translated the human-readable

Cucumber feature specification into an executable test that could be run to validate the

feature. The combination of a testing framework and continuous testing meant that

the team was able to change code without fear of introducing conflicting or defective

changes.

The Cucumber features are written and expected inputs and outputs are specified before

the function or method being tested has been developed. This approach allowed the

developers to view the methods and functions from the user’s point of view; from the

vantage point of a user’s tasks and goals, rather than from the developers, solution-

based, perspective. It also ensured that the technical team was delivering on end-user

requirements.

16

10 http://cukes.info/, last accessed 6 June 2015.11 http://rspec.info/, last accessed 6 June 2015.

3. Architecture of the System

The architecture of the DRI infrastructure was designed to meet the requirements of

the Repository as outlined in Chapter 2. As a Trusted Digital Repository (TDR) the infra-

structure must ensure reliable, long-term, preservation of digital objects. However as a

key aim of the Repository is also the accessibility of content, it is necessary to provide a

comprehensive, fast and easy-to-use user interface with appropriate research tools.

While frameworks adopted by the DRI, such as OAIS (see Section 1.1), provide guidance

on the responsibilities and functions of a TDR, they do not specify any technical solu-

tions, acting only as a high-level guide to some of the types of services that should be

included.

3.1. General overview of the DRI architecture

The DRI infrastructure is broken into several different components. A storage cluster

(see Chapter 5) provides storage and preservation services for the digital objects, as well

as storage for virtual machine images, documentation and other shared data. The

storage is distributed across multiple geographically separate sites to provide failover

and disaster recovery.

A number of physical servers act as virtual machine hosts running a virtualised environ-

ment. These are also located across multiple sites. Virtual servers running within the

virtual environment host the various services required to deliver the Repository func-

tionality. These services include application servers, databases, search and repository

services, authentication services and storage access services. The majority of these serv-

ices are replicated within the virtual environment for scalability and failover purposes.

Load balancers provide for scalability and high-availability both within a single site and

across multiple sites. The entire infrastructure is replicated in a secondary site to allow

for failover and disaster recovery of the whole DRI system.

The architecture of the DRI Infrastructure is shown in Figure 3.1.

17

Web

Lo

ad B

alan

cer

2W

eb L

oad

Bal

ance

r 1

S3 G

atew

ayPa

ssen

ger

/A

pac

he

Ap

pD

atab

ase

serv

erSo

lr

Fed

ora

LDA

PSh

ibb

ole

th

Virtu

al E

nviro

nmen

t

Phys

ical V

M S

erve

rs

Stor

age

Clus

ter

Fig

ure

3.1

: Maj

or

com

po

nen

ts o

f th

e D

RI T

rust

ed D

igit

al R

epo

sito

ry

3.2. Architecture of the DRI application

The Model-View-Controller (MVC) pattern was chosen to implement the Repository.

MVC separates the presentation of data in the system from the back-end representation

and storage of the data. This separation allows for independent components to provide

the required functionality of the system and suited the distribution of work among the

project’s Work Packages. It also makes it easier to replace components without disrupting

the rest of the system and allows for the provision of multiple components with, for

example, different views of the same data for different user types or use cases.

The Model components contain a description of the DRI content model which can be

instantiated by the controllers, and which is specific to DRI’s needs. The models provide

a set of rules for the digital objects and collections outlining the required metadata fields

and various mappings of formats and standards into the DRI content model. This mod-

elling work is one of the primary tasks of the Data Management Layer Work Package

(see Chapter 6). Other models describe users, institutes, licences and other entities within

the system.

The View components in the DRI application are primarily the discovery and administra-

tive interfaces. These are designed and implemented mainly in the User Interface Tools

Work Package. The DRI repository also exposes an API for programmatic access. The API

follows the Representational State Transfer (REST) design principle, where the HTTP

methods, POST, GET, PUT, and DELETE, are explicitly mapped to the create, read, update,

and delete (CRUD) operations of the Repository.

The Controller components contain the business logic which governs how actors interact

with digital objects within the TDR. These interactions are defined by requirements and

policy and are implemented between the User Interface Tools and Data Management

Layer Work Packages as appropriate.

The application stack is illustrated in Figure 3.2 while a detailed class diagram for the

application is given in Appendix 2.

19

Browser /Client App

Web BrowserRest API

Client App

Custom and overridden

Controllers

Custom and overriddenViews

Blacklight Controllers

Blacklight Views

ExternalApps andServices

DRI User Group Gem (authentication)

Hydra Access Controls

DRI Data Models

Custom ModelsBlacklight Search &

Facet logic User Model

Licence

Asset File

InstituteSolrizer

Rubydora

Active FedoraOM

Shibboleth

LDAP MySQL

S3 Gateway

Fedora Solr

Tomcat server

OpenNebula Virtualised Environment

StorageCluster

Ruby on Rails Server

Figure 3.2: Architecture of the DRI Trusted Digital Repository Application

3.3. Deployment architecture

In order to isolate the various services required to deliver a TDR, each service is deployed

into its own virtual machine within the virtual environment. By isolating each service in

this way operational or security issues that need to be addressed for a particular service,

for example upgrading a service or applying a security patch, will not affect other TDR

services.

A private cloud is used to deploy the DRI application together with the supporting com-

ponents and databases. This system is based on OpenNebula12, which provides a

scalable, feature-rich virtualisation stack for Infrastructure as a Service (IaaS) installations

(Mell & Grance, 2011). The storage infrastructure does not run within OpenNebula for

performance, reliability and robustness reasons.

DRI aimed to ensure that the deployment process used to install and configure its hosts

and software is robust and repeatable. A robust and repeatable deployment process

allows the DRI systems engineers to deploy multiple replicas of any given service for

backup or scalability purposes. It also ensures a rapid redeployment in the case of cata-

strophic failure. To this end the testing and deployment process of all system components

is fully automated. The physical and virtual hosts are configured using Ansible13, which is

an agentless configuration management, application deployment and automation tool.

DRI also uses a continuous build and test system to ensure that all new code changes

that are committed to the software repository are automatically tested. In addition, based

on configuration options, a new version of the codebase can be deployed automatically

if these tests pass. DRI uses Buildbot14, an open source application that runs repeated

jobs, such as building applications, to run tests and generate reports. This automation

of building and testing components increases the team’s productivity by providing regular

feedback to the developers.

21

12 http://opennebula.org/, last accessed 6 June 2015.13 http://www.ansibleworks.com/, last accessed 6 June 2015.14 http://www.buildbot.net/, last accessed 6 June 2015.

4. Core Technology Choices

In order to implement the DRI infrastructure a number of technology choices were made.

These choices were informed by the requirements gathering process and the architecture

that the DRI was aiming to implement. Key characteristics based on the DRI’s require-

ments that were pivotal in choosing a repository system included:

Scalability and extensibility

Ease of customisation

Ability to add data types and metadata standards

Range of storage options

Due to the large scope of the project, emphasis was put on scalability and how extensible

each software or storage solution was. In terms of the Repository software two possible

approaches were considered: the team could eitherenhance and customise an existing

repository solution to meet the specific needs of the DRI, or build a new bespoke solution

by combining existing search and repository technologies.

4.1. Repository solutions

A number of existing repository solutions were considered. These mostly aim to provide

an ‘out of the box’ solution, with minimal install and configuration necessary. Examples

include commercial systems, such as CONTENTdm15 originally developed by the

University of Washington (Zick, 2009), DigiTool16 from Ex Libris, as well as open-source

systems such as DSpace17 which is developed under the stewardship of DuraSpace and

EPrints18 created by the University of Southampton.

These systems are capable of scaling to accommodate large collections as would be

required by DRI. Most also offer some level of extensibility and customisation, however,

this tends to be more focused on the User Interface (UI), rather than services and func-

tions such as ingest and storage. Additionally, the long-term continued availability and

support of commercial, closed source products is not guaranteed. While this is also true

for open source systems, the availability of the source code means that it may be possible

to continue development of such systems in-house.

22

15 http://www.contentdm.org, last accessed 6 June 2015.16 http://www.exlibrisgroup.com/category/DigiToolOverview, last accessed 6 June 2015.17 http://www.dspace.org, last accessed 6 June 2015.18 http://www.eprints.org/, last accessed 6 June 2015.

4.2. Repository framework

Given the limitations described in Section 4.1, together with the complicated workflows

required, the mix of user types expected and the number of metadata standards to be

supported, building a new solution was the preferred option. Three basic components

were required for the Repository: a system for storing and managing ingested digital

objects; a search technology for discovery; and an application providing user browsing,

ingest and management functions.

Although a custom solution was seen as the best way of meeting the requirements,

building the system from basic components would have required significant develop-

ment effort. Given the resources available an alternative, hybrid, approach was found,

whereby a framework that occupies a middle-ground between a full repository solution

and the basic components of a system was utilised. The framework chosen is called

Hydra19.

The risk involved with depending on commercial software, particularly for long-term

preservation activities, has already been mentioned. A similar risk could be said to exist

with open-source software, such as Hydra, where long-term continued availability and

support of the product is not guaranteed. This is somewhat negated by Hydra’s strong

community involvement, in the core development activities and in its adoption and

usage. Many large institutions have committed to supporting the project as Hydra part-

ners, including Yale University, Virginia Tech and Princeton University Library. The National

Library of Ireland and University College Dublin are also working with Hydra and its com-

ponents.

By actively participating in this community the DRI has gained from both the experience

of the community and pre-existing state of the art. Re-using tools and libraries con-

tributed by others, such as Sufia20 and Hydra Derivatives21, allowed the development of

the Repository to proceed at a much faster pace than would have been possible had

the system been developed independently.

Project Hydra is a multi-institutional collaboration steered by Stanford University,

University of Hull, University of Virginia, Duraspace and Data Curation Experts. The Hydra

technical framework provides an ‘ecosystem of components’22 on top of which it is pos-

sible to construct a full-featured repository application. Together these components allow

for the Create, Read, Update, and Delete (CRUD) operations required by a repository.

23

19 http://projecthydra.org/, last accessed 6 June 2015.20 https://github.com/projecthydra/sufia, last accessed 6 June 2015.21 https://github.com/projecthydra-labs/hydra-derivatives, last accessed 6 June 2015.22 http://projecthydra.org/, last accessed 6 June 2015.

The primary platforms of the Hydra framework that support these operations are

Fedora23, Solr24, Blacklight25 together with the core Hydra libraries26. These are described

in Section 4.3.

4.3. Repository components

The Hydra framework provides a set of libraries (known as Ruby gems27) that can be

built on to implement a complete repository solution. Applications can be developed

rapidly on top of the Hydra gems using the open-source web framework Ruby on Rails28.

Ruby on Rails is a very scalable development platform. It has been used to build large

high-traffic websites with great success, for example, Twitter and Github29. Together,

the Gems produced by Project Hydra are intended to significantly reduce the need to

spend development resources on constructing the framework of a repository, instead

allowing developers to focus on developing data models and other services.

Applications built in this way are known as ‘Hydra Heads’, where Fedora together with

Solr represent the ‘body’ of a Hydra repository, i.e. the content, while the ‘head’ is the

application layer that connects users to the ‘body’. It is possible to have several Hydra

Heads connected to one body, for example one application for normal users (the con-

sumers of digital objects) and a separate management application for depositors (the

producers of digital objects).

Flexible Extensible Digital Object Repository Architecture (Fedora) is open source software

designed as a core repository service for the storage, management and access of digital

objects. It provides SOAP and REST application programming interfaces (APIs) for soft-

ware development. It is designed to work with any software architecture and to integrate

with a large variety of storage solutions, from common SQL databases to advanced low-

level storage solutions like iRODs, Amazon S3 and SRB. By default it provides Dublin

Core metadata for each digital object, but it allows developers to add any metadata

format to a digital object. Fedora can integrate with other software services using its

Content Model system for describing object types (Davis and Wilper, 2011).

Solr is a Java-based search engine from the Apache Software Foundation. It uses the

Apache Lucene30 search engine library in its core indexing and searching algorithms.

24

23 http://www.fedora-commons.org, last accessed 6 June 2015.24 http://lucene.apache.org/solr/, last accessed 6 June 2015.25 http://projectblacklight.org/, last accessed 6 June 2015.26 https://github.com/projecthydra/hydra-head, last accessed 6 June 2015.27 http://guides.rubygems.org/what-is-a-gem/, last accessed 6 June 2015.28 http://rubyonrails.org/, last accessed 6 June 2015.29 http://rubyonrails.org/, last accessed 6 June 2015.30 http://lucene.apache.org, last accessed 6 June 2015.

Both Lucene and Solr share the same development team. Lucene provides Solr with a

very advanced searching text analysing filter. This allows the software to be configured

to perform complex searches involving boolean expressions, wildcards, date ranges, text

ranges, ‘sounds like’ text matching and geospatial searching. Out of the box, Solr pro-

vides highlighted snippets, multi-select facet filtering, auto-suggest where it provides

suggestions to the user on how to complete a query and spell checking to the user31.

Solr can also automatically extract and index the internal text and metadata of common

proprietary data formats such as PDF and Microsoft Office documents using Apache

Tika32.

Hydra uses Blacklight, a discovery framework, to provide a standard user interface that

allows for search and display of objects stored in the repository. Blacklight provides fea-

tures such as faceted-search and browsing, but is also highly customisable. There are

many examples of interfaces built using Blacklight, for example, The Rock and Roll Hall

of Fame33 and the Stanford University Library Catalog34.

25

31 http://lucene.apache.org/solr/features.html, last accessed 6 June 2015.32 http://tika.apache.org/, last accessed 6 June 2015.33 http://catalog.rockhall.com/, last accessed 6 June 2015.34 http://searchworks.stanford.edu/, last accessed 6 June 2015.

5. Storage

To achieve high levels of trust and preservation in the Repository it was necessary to

build a storage infrastructure that could satisfy DRI’s requirements. These included the

need to have independent, replicated and fault tolerant storage pools, geo-replicated

copies, tape cold storage and integrity checks. The foundation of this is making the

correct technology choices. A full preservation strategy will be published separately to

this report.

5.1. Storage solutions

In order to choose the most appropriate storage for the DRI, testing of storage solutions

was performed and consisted of full hardware installation, adding/removing storage,

high availability and feature evaluation. Requirements such as reliability, scalability, fed-

eration and interoperability were desired. Specific features such data distribution,

replication, scrubbing/self-healing and complete APIs are useful in providing ongoing,

highly-available access.

The DRI decided to focus on software defined storage (SDS) solutions because of the

cost, flexibility and sustainability they can provide. Software defined storage is a form

of storage virtualisation whereby the underlying physical hardware (servers and disks) is

separated from the software that manages the storage. The SDS software provides

access to the storage hardware through standardised APIs and interfaces such as POSIX

shared filesystems, REST/S3 compatible gateways and/or network block devices.

Four SDS solutions were investigated: GPFS35, Hadoop HDFS36, iRODS37 and Ceph38. All

of these run on commodity servers, removing vendor lock-in and thus the team was free

to choose any hardware from any manufacturer. These SDS solutions are available for

many Linux distributions and some run on a Windows server which increases flexibility.

All are highly scalable and are established, well maintained projects that are supported

by large corporations that use and contribute to them.

Of the four solutions evaluated Ceph was deemed to be the most suitable fit for DRI’s

needs. It offers the best and most complete selection of interfaces and APIs; REST/S3

compatible gateway, network block devices, POSIX filesystem as well as a rich array of

libraries and APIs for many programming languages, as identified from the storage eval-

uation. It offers total data reliability with excellent data distribution, replication and self

26

35 http://www.ibm.com/systems/software/gpfs, last accessed 6 June 2015.36 http://hadoop.apache.org, last accessed 6 June 2015.37 http://irods.org, last accessed 6 June 2015.38 http://ceph.com, last accessed 6 June 2015.

repair. It provides petabyte scalability with no single points of failure. It is complete; all

components are developed and maintained by the Ceph project, no external services or

software are required. Finally, it is free and open source, well maintained and is backed

by the Red Hat corporation, who are committed to its long term success.

5.2. Storage architecture

Two technologies are used to manage storage, Ceph and Bareos39. Ceph provides DRI

with cloud-like, storage-as-a-service. It is used for hot or live data storage for virtual

machine (VM) images and repository data. Bareos is used for cold tape storage and backup.

Figure 5.1: DRI storage architecture

27

39 http://www.bareos.org/en, last accessed 6 June 2015.40 http://ceph.com/docs/master/rbd/rbd/, last accessed 6 June 2015.41 http://ceph.com/docs/master/cephfs/, last accessed 6 June 2015.42 http://ceph.com/docs/master/radosgw/, last accessed 6 June 2015.

At its lowest level, Ceph provides storage to the virtual environment, OpenNebula (see

Section 3.3), as a place to store VM disk images over the Rados Block Device40 (RBD)

interface. The repository application uses the Ceph shared filesystem, CephFS41, to store

archival objects while the discovery user interface uses the REST/S3 compatible Rados

Gateway (RadosGW42) to deliver surrogates of these archival objects to users.

FedoraCommons

MySQL SOLR

Virtual Environment

DRI Application

Access Surrogates

Repository Objects

VM Images

HAProxy

CephFS

RBD

RadosGW

disk disk disk

disk disk disk

Figure 5.2: Archival storage workflow

Ceph

Bareos

Offsite Datacenter

Offline Tape Archive

AIP

IngestSite 2: MU

DiskDisk

Disk

Ceph

Site 1: TCD

DiskDisk

Disk

Disk

As a function of preservation, the archival object workflow is the most important aspect

of the Repository’s storage solution. Data is first saved to disk in site one (TCD), it is then

replicated to site two (MU) before being backed up offsite, first to disk and then to tape.

On ingest the Archival Information Package (AIP) is created and stored in a separate

archive pool over CephFS. The AIP is “an information package that is used to transmit

archival objects into a digital archival system, store the objects within the system, and

transmit objects from the system” (Consultative Committee for Space Data Systems

(CCSDS), 2012). This is then replicated to a second, geographically separate, data centre

over secure channels. Regularly, using Bareos, this is backed up to cold storage located

in a third geographically separate data centre; first to disk and then to tape.

The AIP is structured on the MOAB format (Anderson, 2013). This format contains ver-

sioning, full metadata and checksum manifests. The benefit of versioning at the application

layer, as opposed to lower down in the storage layer, is simplified storage management

and layout, efficiency of tape archival, reduction or elimination of duplication in storage,

visibility of versions to the user and ease of recovery in the event of asset integrity failure.

5.3. Storage integrity and trust

Ceph has very high levels of data protection. It has its own in-built checksumming and

regularly checks and automatically fixes any errors it finds in data. It is considered “self-

healing”43. Bareos also keeps checksums and upon backing up of digital objects, will

run checks to see if any errors have occurred, either on disk or on tapes.

A separate process, independent to Ceph and Bareos, checks the integrity of stored

archival objects. This process opens the AIP, recalculates checksums and compares

against the AIP manifest. It does this for all replicas of the AIP.

5.4. Storage migration

Migration of the entire storage solution could become necessary if the system reaches

the limits of its scale, or the end of its supported life (for hardware or software compo-

nents). In such a case it would be necessary to move the data to an upgraded or other

external system. A data migration plan was developed to address this. The plan outlines

the steps required to move data as seamlessly as possible off the current system. Data

consistency checks are part of this process. As the system uses open standards and pro-

tocols, the storage layer can be changed or replaced without affecting the software

stack. This allows for easier transfer of data to another storage solution should the need

arise. Migration to commercial storage solutions is also possible.

29

43 http://ceph.com/docs/master/architecture/#rebalancing, last accessed 6 June 2015.

6. Metadata and Data Modelling

Another important aspect in achieving the DRI’s goal of building a TDR is that the under-

lying data management layer must be capable of dealing with a wide variety of digital

collections. At the same time the data management layer requires a set of data models

that allow preserved data and metadata to be accessed and analysed by new tools devel-

oped in the future. Supporting such a diversity of collections is challenging; individual

collections are prepared and catalogued differently using metadata standards most suit-

able for that type of collection. Some of the most commonly used standards include

Dublin Core44 (DC) for academic, scholarly digital collections; Encoded Archival

Description45 (EAD) for archives; and Machine-Readable Cataloging46 (MARC), and

Metadata Object Description Schema47 (MODS) for libraries (O’Carroll and Webb, 2012).

The use of appropriate data documentation frameworks (metadata standards) is key to

data dissemination, and also impacts discoverability, search and dissemination of

datasets. It is also important that a repository’s data model captures and represents rich

contextual information to enable effective data discovery, sharing and reuse. The deci-

sions on how data and relationships are represented in a digital repository are critical,

and also influence how users interact with the repository.

In this respect, two key requirements have driven the core design of the data manage-

ment layer. The first requirement was to support the most commonly used data, and

metadata, formats and standards. The second key requirement was to support effective

search functionality across collections. In this respect, the DRI data models perform meta-

data and data indexing, with the support of Apache Solr, as discussed in Section 4.3.

Moreover, the data models also incorporate data relationships management to allow

for a meaningful presentation of data to the end user and enhanced data with contex-

tual and linking information, to facilitate browsing across collections.

6.1. Data models overview

Data and metadata are stored in the underlying digital repository as collections of Fedora

digital objects (Lagoze et al., 2005) that can be accessed by Hydra. As the user interface

is a customised Hydra head, and to facilitate the creation, management and storage of

digital objects, the data models are implemented as a Ruby gem, that makes use of

ActiveFedora48, a Ruby gem for creating and managing objects in the Fedora Repository

30

44 http://dublincore.org/documents/dces/, last accessed 6 June 2015.45 http://www.loc.gov/ead/ , last accessed 6 June 2015.46 http://www.loc.gov/marc/ , last accessed 6 June 2015.47 http://www.loc.gov/standards/mods/ , last accessed 6 June 2015.48 https://github.com/projecthydra/active_fedora, last accessed 6 June 2015.

Architecture (Project Hydra, 2009). Implementing the data models as extensions of

ActiveFedora presents numerous advantages. Firstly, it provides a robust framework for

working with Fedora digital objects, negating the need to implement such management

functionalities from scratch. Secondly, ActiveFedora makes use of two key components

that provide both XML-based metadata management and integration with Apache Solr.

The first of these components is the Opinionated Metadata (OM) Ruby gem (Project

Hydra, 2010a); the second is the Solrizer gem (Project Hydra, 2010b). These are both

described in more detail below.

Each of the supported metadata standards are implemented as a set of classes which

extend from the ActiveFedora::Base class. This allows modelling of Fedora digital objects

as Ruby classes that incorporate a set of Fedora datastreams to handle the different

types of metadata required by the system. Examples of metadata types include descrip-

tive metadata, technical metadata and DRI administrative metadata, which incorporates

preservation metadata.

ActiveFedora also facilitates the specification of relationships between digital objects as

Rails associations (Project Hydra, 2012). These relationships are saved using the RDF

(Resource Description Framework49) specification in a special datastream, RELS-EXT. This

allows for easier access and management of the relationships between stored digital

objects, as well as exposing this relational information to third party applications.

Moreover, it sets the basis for future extensions of the data models, for example sup-

porting Linked Data.

Figure 6.2 provides an overview of the data models’ classes, which can be summarised

as follows:

Digital Object classes for the creation, management and storage of different types

of digital objects, based on their source metadata.

GenericFile classes implementing functionalities for the management of the physical

assets associated with metadata digital objects.

Metadata Datastream classes for managing XML-encoded metadata.

The data models also include a set of Ruby modules and ‘mixins’ which implement attrib-

utes and methods that are common to the different data models classes. For clarity,

these have not been included in the diagram.

31

49 http://www.w3.org/RDF/, last accessed 6 June 2015.

DRI Data Models

Solrizer

Rubydora

Active FedoraOM

Figure 6.1: Overview of the DRI architecture with data models highlighted

Fig

ure

6.2

: Ove

rvie

w o

f th

e D

ata

Mo

del

s (c

lass

dia

gra

m)

Act

iveF

edo

ra::B

ase

pid

:Str

ing

find

to_s

olr

save

Gen

eric

File

gf_

Att

rib

ute

sd

ri_p

rop

erti

es: D

RI::M

etad

ata:

:File

Prop

ertie

s

upda

te_fi

le_r

efer

ence

mill

isec

onds

noid

_ind

exer

Bat

ch (

Dig

ital

Ob

ject

)

full_

text

extr

acte

d: D

RI::M

etad

ata:

:Ext

ract

ed

with

_sta

ndar

das

sert

_con

tent

_mod

elup

date

_met

adat

afin

d_or

_cre

ate

noid

_ind

exer

obje

ct_t

ypes

_to_

solr

Act

iveF

edo

ra::O

mD

atas

trea

m

dsI

d:S

trin

g

to_s

olr

save

isG

ove

rned

By

isPa

rtO

f

MA

RC

(D

igit

al O

bje

ct)

mar

c_at

trib

ute

s

split

_xm

lcr

eate

_mul

tiple

_rec

ords

type

EAD

(D

igit

al O

bje

ct)

ead

_att

rib

ute

s

colle

ctio

n_ty

pes_

to_s

olr

obje

ct_t

ypes

_to_

solr

split

_ead

_xm

lsu

nchr

onis

e_if_

chan

ged

MO

DS

(Dig

ital

Ob

ject

)

mo

ds_

attr

ibu

tes

split

_xm

lm

odel

_nam

ero

les

QD

C (

Dig

ital

Ob

ject

)

qd

c_at

trib

ute

s

mod

el_n

ame

role

s

*

*1

1

The ways in which the data models manage different kinds of collections, based on their

source metadata, follow principles and approaches similar to other archives and libraries,

for example Atrium’s EAD content modelling for image-based collections (Johnson,

2011). Unlike the majority of repository implementations, which only support one meta-

data format, the DRI data models include support for multiple metadata standards, as

well as a set of common metadata terms. These common metadata terms are DRI spe-

cific terms, for example title, creator and date, that allow for cross-collection search as

well as visualisation tools.

Adding support for multiple metadata standards to the data models has been challeng-

ing for a number of reasons. The main challenge is the difference between the structures

of the four standards chosen. For example, DC collections present a flat structure; there

is only one collection container which holds all of the sets of metadata, data and digital

objects. In contrast, EAD collections present a complex, hierarchical structure that allows

for the definition of a large number of hierarchy levels, which requires the modelling of

additional types of objects.

In terms of the data modelling or data representation, the approach taken was based

on the design approaches most commonly used within the Hydra community. For

example, digital resources within collections are represented as sets of different types of

digital objects: digital objects containing descriptive metadata about the resource are

linked to digital objects (Generic Files) that store technical metadata and the location of

any digital assets associated with the digital resource (example of such assets include

images, documents, etc.). This allows for the implementation of a set of data models

for the management of “Hydra-compliant” digital objects (Duraspace, 2012). Hydra-

compliant objects held in Fedora are expected, by Fedora, to have a DC metadata

datastream, and a RELS-EXT datastream which declares one or more appropriate content

model (cModels) for the digital object to subscribe to, as well as relationship information.

Additionally, Hydra requires these objects to have an enforceable rights statement. This

is either in a rightsMetadata datastream, which is currently the most common pattern,

and/or an Admin Policy Object50 (APO) which governs the digital object. The data models

implement rightsMetadata datastreams for managing digital object permissions as well

as access policies to enable any form of access/delivery of the digital objects to the user.

The management of access or rights metadata is implemented in a separate Ruby Gem

called UserGroup, developed by the team. In addition to these required datastreams, the

Repository incorporates a number of additional datastreams which vary depending on

the content type of the digital object. These are described in detail in the Section 6.2.

34

50 https://github.com/projecthydra/hydra-head/wiki/Access-Controls, last accessed 6 June 2015.

6.2. Data management: from XML to repository modelled data

The Repository currently supports a number of mechanisms for ingesting data and the

accompanying metadata into the repository. Digital objects can be ingested through

third-party applications via the DRI REST API (see Section 3.2). They can be ingested

through a web form based ingest via the user interface. Finally they can be ingested via

the Client application which, although it is currently command line based, will incorpo-

rate a graphical user interface.

Once metadata are uploaded, a number of validations are performed to ensure validity

and correctness, prior to ingest. As the metadata are XML-encoded, the first validation

involves checking whether the source XML is well-formed and conforms to its associated

metadata schema. For DC, MODS, and MARCXML metadata validation against XML

Schema Definition (XSD) is performed. For EAD either Document Type Definition (DTD)

or XSD validation can be performed depending on the format of the source metadata.

Additional DRI-specific validations are also required, for example ensuring that the pro-

vided metadata include the mandatory DRI fields, and performing data type based

checks. Temporal metadata fields, including dates or date ranges, are checked to ensure

that they are encoded using the recommended standards for temporal data representa-

tion, ISO 860151 or W3C w3cdtf52.

Valid data and metadata can then be processed and stored in the Repository using the

data models. The business logic of the data management layer can be divided into two

main controllers. The first controller handles the structure, storage, and retrieval of meta-

data, while the second applies the same processes to the digital assets, i.e. the data

itself.

6.2.1. Metadata management

As introduced earlier (see Section 6.1), the data models follow the design principles for

Hydra-compliant digital objects where every digital object must incorporate a minimum

set of datastreams. The data models incorporate a number of additional datastreams

for metadata management. As collections may present different internal structures (flat

versus hierarchical, depending on their associated metadata standard), the type and

number of these datastreams vary.

35

51 http://www.iso.org/iso/home/standards/iso8601.htm, last accessed 6 June 2015.52 http://www.w3.org/TR/NOTE-datetime, last accessed 6 June 2015.

For example, collections encoded using a flat metadata structure, such as Dublin Core,

are represented as sets of digital objects (the metadata objects) which are governed by

‘container’ digital objects (collection/sub-collection objects). Each of the digital objects,

one per metadata record, can be associated with other generic file objects which repre-

sent physical assets in the Repository. For this kind of collection, only one type of

metadata datastream (DC descriptive metadata datastream) is required as the only

descriptive difference between metadata objects and container objects is their type.

In contrast, for collections with a hierarchical structure, such as EAD or MODS, each of

the components within the hierarchy are described in the XML using different metadata

terms. For example, in EAD the collection container is described using the <archdesc>

element, and any components within the collection are described using the <c> element.

In terms of the data models, this means that different types of datastreams are needed

to store the XML that describes the different types of components, so as to distinguish

collection objects from sub-collections or bottom-level objects within the hierarchy.

In Hydra and Fedora terms, each metadata record is modelled as a compound object

with two datastreams storing the XML-encoded metadata. This information is stored in

the form of XML snippets. Figure 6.3 shows the different types of datastreams for meta-

data objects. The first type of datastream, descMetadata (descriptive metadata), stores

metadata at the object level. For digital objects in a collection with a hierarchical struc-

ture, only the metadata associated with an object is stored in this datastream. Metadata

objects within the hierarchy also store, in the second type of datastream (fullMetadata),

a more in-context XML snippet with information about the object’s immediate children.

Storing metadata information with such a low level of granularity allows for indexing a

collection’s individual parts into Solr so they can be independently discovered and visu-

alised. At the same time, they contain sufficient contextual information so they can refer

back to other relevant objects within the collection.

36

Fig

ure

6.3

: Cla

ss d

iag

ram

sh

ow

ing

met

adat

a d

igit

al o

bje

ct a

nd

its

asso

ciat

ed d

atas

trea

ms

Dig

ital

Ob

ject

ob

ject

_att

rib

ute

sre

lati

on

ship

s

inde

xing

_met

hods

rela

tions

hip_

proc

essi

ngm

etad

ata_

proc

essi

ng

Inte

rch

ang

eab

leM

etad

ata

<<

Mix

in>

>

has

_met

adat

ad

ri_t

erm

s_d

efin

itio

nd

ri_r

elat

ion

ship

s

met

adat

a_in

itial

izer

slo

adin

g_he

lper

s

pro

per

ties

met

adat

a_te

rmin

olo

gy

is_c

olle

ctio

n?

des

cMet

adat

a

met

adat

a_te

rmin

olo

gy

met

adat

a_in

dexi

ngm

etad

ata_

valid

atio

ns

fullM

etad

ata

incl

ud

es

The information stored in the metadata datastreams are managed by Ruby classes which

extend from ActiveFedora::OmDatastream. OmDatastream is a Ruby class included in

the Opinionated Metadata (OM) gem53 (see section 6.1). Using this class, the DRI data

models can easily manage Fedora XML datastreams since it allows for assigning Ruby

attribute accessors (with Xpath support) to specific XML elements. In this way, metadata

information can be easily managed without having to access the source XML directly.

ActiveFedora also provides attribute accessors to datastreams extending from

OmDatastream, so that objects at higher levels within the data model have direct access

to the metadata terms stored in datastreams. For example, the data model’s ‘title’ acces-

sor provides access to the information contained in the metadata ‘title’ term. Calling

this accessor will invoke the title accessor defined in OmDatastream, which in turn allows

for accessing the source XML where the title information is actually stored. Another

advantage of OmDatastream is its native support for Solr indexing. ActiveFedora includes

mechanisms to convert between the standard Ruby hash data structure and a Solr doc-

ument data structure, i.e. the ActiveFedora model implements methods that transform

metadata information into a data structure in which each data is stored in key-value

pairs, with each key being the name of a Solr field, and each value holding the data to

be indexed.

6.2.2. Digital assets management

Uploaded preservation-quality assets undergo a rigorous set of checks before they are

permanently stored in the Repository. As set out by requirements, incoming files are

scanned for malware and validated for file type correctness. Failure of the malware check

forces the Repository to reject the uploaded files, while other tests deliver error messages

to the user attempting the upload. If all tests are passed, the assets are preserved in the

internal storage system (see Section 5.2).

Similar to how metadata digital objects are treated, a digital asset is modelled as a class

extending from ActiveFedora::Base (i.e. DRI::GenericFile introduced above). Generic files

are digital objects with content-bearing datastreams to represent the physical contents

of the digital asset, as well as a datastream that contains descriptive information (tech-

nical metadata) about the asset. Figure 6.4 shows the classes that have been

implemented to represent Generic Files.

In terms of modelling this kind of digital object, Hydra models typically follow two dif-

ferent approaches (“Design principles,” n.d.). The first is based on the definition of

38

53 https://github.com/projecthydra/om, last accessed 6 June 2015.

‘compound’ content objects with a number of content-bearing datastreams, whilst the

second approach defines ‘atomistic’ or complex objects where a parent object is linked

to content held in one or more child objects. In the case where compound objects have

just one content-bearing datastream they are usually referred to as ‘simple’ content

objects. Compound objects can be useful in terms of dissemination, as one digital object

can store preservation-quality assets, as well as lower resolution assets for internet deliv-

ery. However, the DRI data models implement ‘simple’ content objects: the

preservation-quality asset being represented in the content-bearing datastream. Lower

resolution files can then be generated for dissemination via the Repository’s user inter-

face. The rationale for this is that it translates into storing less complex digital objects,

for which preserved materials are separate from purposively generated ones which, if

needed, can be updated and/or regenerated.

Figure 6.4: Class diagram of the Generic File for digital asset representation

Content digital objects are stored in the digital repository once they have passed all the

system tests and Fedora digital objects are then generated. A one-to-many association

links the Generic File model used to represent assets with its associated metadata digital

object. This creates a simple child-to-parent data model association (isPartOf) that links

many assets to one metadata digital object.

Once the asset is saved in Fedora, it is added to the Repository’s surrogate Resque54

queue, where uploaded preservation-quality media files are converted into lower-quality

surrogate files intended for access and discovery. All surrogates are automatically gen-

erated by background processes running in the servers. Maintaining a centralised system

39

54 https://github.com/resque/resque, last accessed 6 June 2015.

Generic File

:belongs_to DRI::Batchattributes

update_file_referencemillisecondsnoid_indexer

DRI Properties

<<DRI::Metadata::FileProperties>>

metadata_terminology

for generating surrogates is resource intensive but ensures full control of surrogate gen-

eration. Furthermore, it makes the adoption of future internet technologies a seamless

process. It will allow the DRI to easily update the parameters needed for the creation of

new types of surrogates, as well as to meet future end users’ expectations of how these

media files are displayed.

Within the Repository, the preserved assets are treated in a similar fashion to preserved

metadata records. The goal is to extract as much descriptive information as possible

from asset contents, while at the same time, making this information available to users.

The Repository extracts machine-readable file information such as the file’s dimensional

characteristics, statistical information and metadata automatically generated by source

tools. Currently, the Repository uses Harvard’s FITS (Harvard, 2009) tool to extract these

metadata fields, storing the results as a FITS XML file. As well as the metadata fields,

the Repository also extracts full-text from textual documents, which allows users to

perform complex searches on a document’s full text. In this respect, Apache Solr provides

a built-in functionality for the indexing of full-text, and also supports a wide variety of

document file types. For this, Solr Cell is used as it facilitates indexing all document types

supported by the Apache Tika55 project.

6.3. Data indexing, search and retrieval

The Repository’s data models handle indexing into Solr, which is performed using

ActiveFedora’s built-in indexing support, through the Solrizer Ruby gem. Both digital

objects and datastreams extending from ActiveFedora have access to Solrizer indexing

methods, specifically, to the to_solr method. Invoking this method from a digital object’s

datastream triggers the transformation of metadata contents into a Solr document. This

is an XML document in Solr’s expected index format. Solrizer then commits the index

document into the Solr instance. For digital objects, the process is repeated for each of

the datastreams of the digital object. Saving digital objects triggers to_solr calls, which

in turn send the contents to be indexed into Solr, thus allowing for automatic updates.

The process of indexing is illustrated in Figure 6.5.

The data management layer uses indexing to present the data it stores in meaningful

ways to the end user. The Repository’s user interface provides the user with searching

and faceting or filtering functionalities, with the support of Apache Solr search engine.

This allows the user to focus their search results from the large number of available

datasets, as well as to explore the contents of collections in a more user-friendly and

interactive way. Faceting breaks up search results into multiple categories and allows

40

55 http://tika.apache.org/, last accessed 6 June 2015.

Fig

ure

6.5

: Seq

uen

ce d

iag

ram

sh

ow

ing

th

e p

roce

ss o

f in

dex

ing

into

So

lr

:Act

iveF

edor

a::B

ase

<<

Dig

ital O

bjec

t>>

:Act

iveF

edor

a::O

mD

atas

trea

m<

<M

etad

ata>

>:S

olriz

er:S

olriz

er::F

edor

a::In

dexe

r:T

omca

t Se

rver

::Sol

r

sd D

ata

Mod

el In

dexi

ng in

to S

olr

loop

to_s

olr

[0, d

atas

trea

ms.

coun

t -1

]

to_s

olr(

solr_

doc)

solr_

doc

mer

ge(s

olr_

doc)

to_s

olr

inde

x(so

lr_do

c)

upda

te(s

olr_

docu

men

t)

com

mit

resp

onse

resp

onse

the user to further restrict search results, based on the facets selected.

The source metadata format used does not restrict indexing into Solr as a result of the

data models’ transformations performed on ingest. One such transformation would be

the processing of personal names. For example, searching for “Eamon De Valera” as an

author will return objects that store any version of that name in the <origination> (EAD)

and <creator> (Dublin Core) metadata fields.

Similar processes also apply to the presentation of temporal data. An example of this is

how the data models manage date ranges stored as DCMI periods: e.g. the value “1721-

1730”, stored in the <date> field. Rather than simply storing it as two dates, as many

systems would automatically do, thereby missing the meaning of the range as under-

stood by a human reader, the data models make use of Solr indices that are specifically

designed to handle dates and date ranges. When searching the Repository this allows

for the return of any digital object that falls within the specified date range.

The rationale for such transformations to be applied to the original, source metadata is

to enhance how data is presented to the end user. The Repository’s indexing and faceting

allows the linking of data through meaning; the information is presented and searched

in a semantic way, rather than treating it as plain text. This indexing information is avail-

able to third party tools through the API.

It is also possible for third party developers to create their own data model to interpret

the Repository’s stored data in order to expand on this semantic interpretation, or to

present to the end user the stored data in a different format.

6.4. Relationship representation in the data models

In addition to data indexing, the ability to represent and store data relationships can

enhance data visualisation and also provide the end user with richer contextual infor-

mation. This is particularly important when exploring a vast number of collections. The

data models incorporate data relationships management when representing collections

in the Repository. Figure 6.6 shows how relationships can be used to represent the struc-

ture of a collection stored in the digital repository. The particular choice of supported

relationships was selected based on the needs of the individual the demonstrator proj-

ects, as well as on their degree of appropriateness for describing the kinds of collections

that will be held in the Repository. The relationships supported by the data models can

be grouped into two main categories:

42

Data models internal relationships, which are mostly used for representing collec-

tion structure and membership

Metadata-defined relationships, which are used for representing associations

between digital objects

The first type of relationship is common to every object, independent of its metadata

type, and can be used to express navigation or hierarchy. Examples of this type of rela-

tionship include: isGovernedBy, isMemberOfCollection/hasMember, or isPart/hasPart.

The second type of relationship is provided via the descriptive metadata associated with

digital objects. The data models implement a different set of relationships for each sup-

ported metadata standard as the type and number of relationships vary from one

standard to another. Examples of this type of relationship include isPrecededBy/

isSucceededBy, isDocumentationFor, and isReferencedBy.

Figure 6.6: Example of relationship representation for DRI collections

43

Digital Asset<<DRI::GenericFile>>

Metadata Object<<DRI::Batch>>

Sub-collection<<DRI::Batch>>

Collection Container<<DRI::Batch>>

Sub-collection<<DRI::Batch>>

Metadata Object<<DRI::Batch>>

Digital Asset<<DRI::GenericFile>>

isGovernedBy isPartOf

isMemberofCollection

isMemberofCollection

isGovernedBy isPartOf

Table 6.1: Supported relationships in DRI

MetadataStandard

DC

MODS

MARC

Relationship Name

Relation

Is Part Of

Is Referenced By

References

Is Version Of

Has Version

Is Format Of

Preceding

Succeeding

Host

Constituent

Original

Other Version

Review Of

Other Format

Is Referenced By

References

Series

Other Edition Entry

Additional PhysicalForm Entry

Preceding Entry

Succeeding Entry

Other RelationshipEntry

Relationship

dcterms:relation

dcterms:isPartOf

dcterms:isReferencedBy

dcterms:references

dcterms:isVersionOf

dcterms:hasVersion

dcterms:isFormatOf

relatedPreceding

relatedSucceeding

relatedHost

relatedConstituent

relatedOriginal

relatedVersion

relatedReview

relatedFormat

relatedReferencedBy

relatedReferences

relatedSeries

MARC Tag: 775

MARC Tag: 776

MARC Tag: 780

MARC Tag: 785

MARC Tag: 787

Note

Link to internal DRI objects, or externaldigital materials

Collection structure/hierarchy representation






Expression of sequencing
















Table 6.1 summarises the relationships that have been implemented in the data models,

for each of the supported metadata standards.

From a technical perspective, relationships are implemented in the data models by using

ActiveFedora’s built-in support for relationships definition. Relationships are represented

as Rails associations between objects (Coyne, 2013), and the associations used are of

the types ‘has_many’, a one-to-many relationship, and ‘belongs_to’, a one-to-one rela-

tionship. Additionally, ActiveFedora uses RDF to represent these relationships. Whenever

an association between two objects is created, ActiveFedora writes this metadata into

the RELS-EXT datastream of the related digital objects. The sample code below shows

how ActiveFedora relationships are defined in Ruby, as well as their translation into RDF

assertions in the RELS-EXT datastream.

Figure 6.7: ActiveFedora relationships as defined in Ruby

45

class Batch < ActiveFedora::Base

has_many :parts, :property => :is_part_of, :class_name => ‘DRI::Batch’

belongs_to :is_governed_by, :property => :is_part_of, :class_name =>

‘DRI::Batch’

…

end

-------------------------------------------------------------------------

-

<?xml version=”1.0” encoding=”UTF-8”?>

<rdf:RDFxmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”

xmlns:ns0=”info:fedora/fedora-system:def/model#”

xmlns:ns1=”info:fedora/fedora-system:def/relations-external#”>

<rdf:Description rdf:about=”info:fedora/pid:2”>

<ns0:hasModel rdf:resource=”info:fedora/afmodel:DRI_Batch”/>

<ns1:isPartOf rdf:resource=”info:fedora/pid:1”/>

</rdf:Description>

</rdf:RDF>

The model’s RDF assertions shown above are automatically created based on the data

model name from the ActiveFedora model. The rest of defined relationships take their

name from a configuration file (predicate_mappings.yml) that ActiveFedora uses to look

up RDF predicates, based on the property name being specified in the data model. The

data models override this file to include a custom set of specific relationships, as well as

the set of relationships implemented for each of the supported metadata standards.

Currently the data models processes relationships if the information is included in the

sourced metadata, but in the future will support their addition via the user interface.

Providing support for adding relational information through the user interface is highly

important as some of the supported metadata standards, for example MARC, present

limitations in how relationship information can be specified in the metadata.

46

7. User Interface Design

As part of the overall requirements gathering process (see Chapter 2) and to provide a

brief for later visual design phases, an evaluation of information access use cases was

carried out. This involved examining real-world digital libraries, repositories and other

non-academic digital asset management sites. Each was considered from the point of

view of functionality, usability and accessibility.

Following the evaluation of user interfaces a number of recommendations were made

regarding the architecture of the user interface (UI). The Repository should:

Make search integral to the interface.

Support faceted browsing capabilities across collections and broad categories.

Develop the interfaces within a well-supported, Open Source, framework and utilise

well-supported standards based visualisation tools.

Provide object usage and geolocation visualisations.

Provide timeline visualisation.

7.1. Interface design principles

It is important to note that a decision was made in the proposal stages of the project to

only support HTML556 compliant web browsers. HTML5 is the W3C’s candidate recom-

mendation for an fully interoperable HTML standard since July 2013. This has allowed

the development team to concentrate on modern techniques and supporting software

frameworks without the need to consider support for legacy browsers.

As discussed by Johnson (2010) the use of grid-based design systems predates the Web.

In Johnson’s view grid designs are simply adhering to the universal design principle of

alignment. Grid designs simplify the information retrieval task for users by making the

display readily understandable. The more rapidly users can identify a pattern to the infor-

mation the more quickly they can move on to information analysis.

With these design principles in mind the DRI desktop web layout is based on a four

column 1024px fixed grid. This layout allows for a balance between a predictable

desktop layout and an uncomplicated reconfiguration of the display for mobile devices

based on a 230px width.

47

56 http://www.w3.org/TR/html5/, last accessed 6 June 2015.

Figure 7.1: DRI desktop fixed grid concept design

CSS Media queries were introduced in the W3C CSS3 recommended standard in June

2012. They allow the browser to select a CSS style based on the capabilities of the device

displaying it. The most important of these queries is the width of the browser. The

designs also make use of semantic layout, including the use of meaningful container

elements such as header, article, aside, nav and footer to aid accessibility.

Adaptive and responsive web design is a relatively recent web design approach. This

approach acknowledges the range of device and use contexts which web sites must

now provide for. New device contexts such as tablet computers and smartphones now

account for a substantial proportion of all web site usage (Google, 2015) and the

Repository is expected to be accessed across all device types.

There are two similar but subtly different possible approaches to dealing with device

capabilities in CSS. The first, responsive design, uses fluid designs to allow the page

layout to change incrementally across a range of screen sizes using fluid percent rage

based CSS markup. Adaptive design by contrast sets ‘breakpoint’ values for screen size

that changes the layout (sometimes dramatically) when, for example, a browser screen

below a certain value is detected. The overall aim of both approaches is to provide the

user with an experience which is optimal for their use context.

The design takes an adaptive approach that allows the desktop media rules to present

a known layout that is of fixed width. The tablet and smartphone use context are given

separate CSS while making use of the same HTML markup. This overall approach makes

the redesign process relatively straightforward and provides for a very clean separation

of content and presentation.

49

Figure 7.2: Designs for tablet

Figure 7.3: Design for search result page – desktop version

7.2. Public and discovery interfaces

Digital repositories can be used for many reasons, but arguably the most important set

of use cases centre around accessing information. There are also many content-specific

and goal-directed requirements that might lead a user to access a collection of digital

information (Saracevic, 2000).

The designs have adopted the ‘search everywhere’ pattern as suggested by Hearst and

others (Hearst 2009). It is intended that the user should be able to trigger a content

search using a text string and that this search form should be available on all pages. The

initial search trigger is a keyword or keyword-wildcard ‘*’ search. This brings the user

into the search result page where teaser views of digital objects are displayed. After trig-

gering a keyword search the user is able to make use of the facet filtering to narrow the

search. The Repository makes use of the Blacklight software component (see Section

4.3) to facilitate both the keyword search and facet filtering.

7.3. UI architecture

The user interface designs are realised using several technologies that are outlined below.

7.3.1. Ruby on Rails views

As the Hydra platform was selected as the framework architecture for Repository, the

implementation of its user interfaces utilises the Model-View-Controller pattern of its

programming language Ruby on Rails (see Section 4.3). A view in the default configu-

ration of Ruby on Rails is a type of template file (erb file) which mixes HTML markup

and dynamically generated data from the application and renders the final webpage. A

single view may also be split into several separate erb files or view “partials”. Typically

these partials correspond to some functional part of the page such as the main content

or navigation. It is by manipulating these view partials that the overall HTML rendering

is achieved. Non-HTML views can also be created to allow rendering in different formats,

such as XML or JSON 56 .

7.3.2. JavaScript framework

The JavaScript framework used by Repository is JQuery. It provides libraries and functions

of common JavaScript tasks such as carrying out AJAX requests and performing UI ani-

mation. This has been the standard Ruby on Rails framework since version 3 and is used

52

56 JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to readand write. It is easy for machines to parse and generate. It is based on a subset of the JavaScriptProgramming Language. http://www.json.org/, last accessed 6 June 2015.

commonly used by components such as Blacklight. Since the UI team was already familiar

with JQuery it was decided to continue development using this framework, which is

arguably becoming the de facto standard (W3Techs, 2014).

7.3.3. SASS and Grid framework

The complex adaptive layouts envisaged for the Repository makes the task of producing

CSS quite difficult. To aid in the task the Repository makes use of the SASS language57

and a grid framework called Zen Grids58.

SASS is an extension of the CSS3 language that adds programmatic elements from other

languages such as nested rules, variables, mixins (similar to functions, procedures, or

methods in other languages) and selector inheritance. SASS is not currently understood

by web browsers natively but is compiled into well-formatted CSS3 using either a

command line tool or a web-framework plugin, available as standard in Rails 3.

The Zen Grids Framework is a set of SASS mixins that provide tools for making respon-

sive or adaptive grid based layouts. This allows the grid design layouts to be more easily

specified and separated from the design of the block level element designs. This had

the added advantage that block level designs can be treated almost independently of

page layout.

7.4. Visualisations

Three types of data visualisations are being supported within the application. These are

full-text context visualisation, also known as ‘hit highlighting’, search result timelines

and search result mapping.

7.4.1. Timelines

Timelines are supported by rendering a JSON formatted search result within a JavaScript

HTML5 timeline ‘widget’. The framework currently being used is called TimelineJS.

TimelineJS is an open-source tool that renders the JSON as a HTML5 compatible inter-

active timeline.59

53

57 http://sass-lang.com/, last accessed 6 June 2015.58 http://zengrids.com/, last accessed 6 June 2015.59 https://github.com/NUKnightLab/TimelineJS, last accessed 6 June 2015.

Figure 7.4: Timeline visualisation

7.4.2. Mapping

Mapping of search results are also supported by rendering a JSON formatted result

within another JavaScript framework. The framework currently being used is called

OpenLayers60. OpenLayers is a web mapping library which takes Geo data and using

HTML5 techniques renders the data onto a map. Map tiles can be downloaded from a

variety of both Open Source, free and commercial sources which gives the Repository

good flexibility into the future.

Figure 7.5: Map visualisation

55

60 http://openlayers.org/en/v3.3.0/doc/, last accessed 6 June 2015.

8. Conclusion

As discussed, the Repository development was based on a rigorous requirements gath-

ering and management process which informed the overall development of the

Repository by Strand 3, with input from the Work Packages, Task-forces and Working

Group across Strands 1, 2 and 4. The development of the Repository will continue in

the future; work on a Trusted Digital Repository is never complete. There are a number

of further features planned including automated aggregation of (meta)data from part-

ners, the expansion of preservation activities, support for additional metadata standards

and a browser based bulk ingest tool. Over time hardware failures may occur, file formats

become obsolete, user access devices and styles change and new tools and methods

are developed. The DRI Repository must be able to respond to these new requirements

over time and ensure the preservation and sustained access for the data held.

56

Bibliography

Anderson, Richard. (2013). The Moab Design for Digital Object Versioning. Code4Lib

Journal, (Issue 21). Retrieved from http://journal.code4lib.org/articles/8482

Chelimsky, D., Astels, D., Helmkamp, B., & North, D. (2010). The RSpec Book Behaviour

Driven Development with Rspec, Cucumber, and Friends . Pragmatic Bookshelf.

Consultative Committee for Space Data Systems (CCSDS). (2012). Reference Model for

an Open Archival Information System (OAIS). Washington, D.C.: Magenta Book.

Retrieved from http://public.ccsds.org/publications/archive/650x0m2.pdf

Coyne, J. (2013). Relationships. Retrieved from https://github.com/projecthydra/active_

fedora

Data Seal of Approval Board. (2013). Data Seal of Approval Guidelines version 2.

Retrieved from http://datasealofapproval.org/media/filer_public/2013/09/27/guide-

lines_2014-2015.pdf

Davis, D., & Wilper, C. (2011). Content Model Architecture. DuraSpace wiki. Retrieved

from https://wiki.duraspace.org/display/FEDORA35/Content+Model+Architecture

Design principles. (n.d.). Retrieved from http://projecthydra.org/design-principles-2/

Duraspace (2012). Hydra objects, content models (cModels) and disseminators. Retrieved

from:

https://wiki.duraspace.org/display/hydra/Hydra+objects%2C+content+models+%28cMo

dels%29+and+disseminators#Hydraobjects,contentmodels%28cModels%29anddis-

seminators-Don%27tcallitacontentmodel

European Parliament, Council of the European Union. (2009). Directive 2009/136/EC of

the European Parliament and of the Council of 25 November 2009 amending Directive

2002/22/EC on universal service and users’ rights relating to electronic communications

networks and services, Directive 2002/58/EC concerning the processing of personal

data and the protection of privacy in the electronic communications sector and

Regulation (EC) No 2006/2004 on cooperation between national authorities respon-

sible for the enforcement of consumer protection laws. Official Journal of the

European Union, L 337/11. Retrieved from http://eur-lex.europa.eu/legal-

content/EN/TXT/PDF/?uri=CELEX:32009L0136&from=EN

Google (2013). Our Mobile Planet: Ireland, Understanding the Mobile Consumer.

Retrieved from http://services.google.com/fh/files/misc/omp-2013-ie-en.pdf

Government of Ireland. Copyright and Related Rights Act (2000). Retrieved from

http://www.irishstatutebook.ie/2000/en/act/pub/0028/

Government of Ireland. Freedom of Information Act (1997). Retrieved from

http://www.irishstatutebook.ie/1997/en/act/pub/0013/

57

58

Harvard University. (2009). File Information Tool Set (FITS). Retrieved from:

http://projects.iq.harvard.edu/fits/home

Hearst, M. A. (2009). Search User Interfaces (1st ed.). New York, USA: Cambridge

University Press.

Higher Education Authority. (2013). National Principles for Open Access Policy Statement.

Retrieved from

http://www.hea.ie/sites/default/files/national_principles_on_open_access_policy_state-

ment_final_23_oct_2012_v1_3_0.pdf

Johnson, J. (2010). The 960 Grid System Made Easy. Six Revisions. Retrieved from

http://sixrevisions.com/web_design/the-960-grid-system-made-easy/

ACM.

Johnson, R. (2011). Atrium EAD Content Modelling. Retrieved from https://wiki.dura-

space.org/display/hydra/Atrium+EAD+Content+Modelling

Kitchin, R., Collins, S., & Frost, D. (in press). Funding models for open access digital

repositories. Online Information Review.

Lagoze, C., Payette, S., Shin, E. and Wilper, C. (2005). Fedora: An Architecture for

Complex Objects and their Relationships. Forthcoming in Journal of Digital Libraries,

Special Issue on Complex Objects, Springer

Mell, P. M., & Grance, T. (2011). The NIST Definition of Cloud Computing. Retrieved from

http://www.nist.gov/manuscript-publication-search.cfm?pub_id=909616

O’Carroll, A., Collins, S.,Gallagher, D.,Tang, J., & Webb, S. (2013) Caring for Digital

Content, Mapping International Approaches. Maynooth: NUI Maynooth; Dublin: Trinity

College Dublin; Dublin: Royal Irish Academy. Retrieved from http://dri.ie/caring-for-

digital-content-2013.pdf

O’Carroll, A. & Webb, S. (2012). Digital archiving in Ireland: national survey of the

humanities and social sciences. National University of Ireland Maynooth. Retrieved

from http://dri.ie/digital-archiving-in-ireland-2012.pdf

Online Computer Library Center & Center for Research Libraries. (2007). Trustworthy

Repositories Audit & Certification. Retrieved from http://www.crl.edu/sites/default/files/

attachments/pages/trac_0.pdf

Project Hydra. (2009). ActiveFedora. Retrieved from https://github.com/projecthydra/

active_fedora

Project Hydra. (2010a). Opinionated Metadata. Retrieved from https://github.com/pro-

jecthydra/om

Project Hydra. (2010b). Solrizer. Retrieved from http://rubygems.org/gems/solrizer

Project Hydra. (2012). ActiveFedora Relationships. Retrieved date from: https://

github.com/projecthydra/ active_fedora/wiki/Relationships

Research Libraries Group and OCLC. (2002). Trusted Digital Repositories: Attributes and

Responsibilities. USA: Research Libraries Group. Retrieved from http://www.oclc.org/

content/dam/research/activities/trustedrep/repositories.pdf

Saracevic, T. (2000). Digital Library Evaluation: Toward an Evolution of Concepts. Library

Trends, 49(2), 350–69.

Wiegers, K.E. (2003). Software Requirements: Practical Techniques for Gathering and

Managing Requirements Throughout the Product Development Cycle. Microsoft Press.

Wiegers, K.E. (2006). More About Software Requirements: Thorny Issues and Practical

Advice. Microsoft Press.

W3Techs. (2014). Usage statistics and market share of JQuery for websites.

W3Techs.com. Retrieved from http://w3techs.com/technologies/details/js-jquery/all/all

Wynne, M. & Hellesøy, A. (2012). The cucumber book: behaviour-driven development

for testers and developers. (Carter, J., Ed.). The Pragmatic Bookshelf.

Zick, G. (2009). Digital Collections: History and Perspectives. Journal Of Library

Administration, 49(7), 687.

59

61

Appendix 1: Cucumber Feature Example

Scenario: Bulk Ingest of a directory 10 of assets and metadata.xml

files

Given I am logged in as “user1”

And there is an existing collection

And I have depositor permissions for the collection

And there is a valid “bulk ingest” location

And the “Digital Assets” are valid

And the metadata are valid

When I select the collection

And I specify the “bulk ingest” location

And I run the “bulk ingest” command-line tool

Then the digital objects should be created in the collection

And I should get a “report”

And the report should contain a list of “PIDs”

And the report should contain a list of “URLs”

And the report should contain a list of “checksums”

CollectionsController

createdestroyeditnewpublishreviewupdate

_layout create_from_form create_from_xml create_reader_group delete_collection publish_collection reader_group_name valid_permissions?

MetadataController

showupdate

_layout

SessionController

create

_layout

DatastreamVersionController

show

_layout

CatalogController

enforce_search_for_show_per-missions exclude_unwanted_models rows_per_page solr_search_params_logic solr_search_params_logic= solr_search_params_logic?

_layout

ObjectHistoryController

show

_layout

ObjectsController

citation create edit index new related show status update

_layout

UserReportController

index

_layout

ApplicationController

_logging_in_user_callbacks _logging_in_user_callbacks= _logging_in_user_callbacks? after_sign_out_path_for duplicates?retrieve_object retrieve_object! set_access_permissions set_cookieset_locale solr_access_filters_logic solr_access_filters_logic= solr_access_filters_logic? supported_licences

_layout authenticate_user_from_token! duplicates

InstitutesController

associateassociate_depositingcreatenewshow

_layout

TimelineController

get

_layoutcover_image create_timeline_data default_image get_cover_image get_query parse_dcmi parse_dcmi? search_image surrogate_url

63

Appendix 2:ControllersClass Diagram

ExportController

showsolr_search_params_logicsolr_search_params_logic=solr_search_params_logic?

_layout

LicencesController

create edit index new show update

_layout

PagesController

_layout_from_proc

_layout

WorkspaceController

enforce_search_for_show_permissions exclude_unwanted_models rows_per_page solr_search_params_logic solr_search_params_logic= solr_search_params_logic?

_layoutMapsController

get

_layout create_maps_data get_query parse_dcmi parse_dcmi?

SavedSearchesController

clear forget index save

verify_user

_layout

SurrogatesController

download show update

_layout generate_surrogatessolr_query surrogates

ErrorController

error_404 error_422 error_500

_layout

AssetsController

createdownload list_assets local_storage_dir showupdate

_layoutcreate_filesave_file upload_from_paramsvalidate_upload

HighVoltage::PagesController

Documents

Building the Digital Repository of Ireland Infrastructuredri.ie/sites/default/files/files/building-dri-infrastructure.pdf · Peter Tiernan, Systems and Storage Engineer, Trinity College