Neil Jefferies Bodleian Libraries - OCLC · 2020-04-29 · BIBFRAME The Library of Congress (LoC) has developed the BIBFRAME ontology as an (eventual) replacement for MARC, the current

Neil Jefferies

Bodleian Libraries

Structure Analysis

Deconstructing Digital Libraries

Contextual Data Models

Two projects

Linked Data for Libraries (LD4L)

CAMELOT

Conclusions

Neil Jefferies, Bodleian Libraries, University of Oxford

Deconstructing digital libraries What are the salient features of the content?

Tablets? Scrolls? Books? Journals? Images? Datasets?

Ideas and the development of ideas Data vs Facts vs Information vs Knowledge The artefects of “intellectual discourse”

Libraries are not archives or museums Discovery, access, reach and accessibility Low latency

History Custodians of knowledge Content generator as well as repository The “scholar librarian”

Nature of intellectual discourse Artefacts/mechanisms of discourse are changing...quickly

Books, letters, papers, websites, social media, linked-data

Reduced turnaround times

Desire for currency of content shortens “publication” cycle

Rise of interdisciplinary/collaborative projects

Reduce silo-ization

Broad discovery and linking

What about “subject specialists”

Interoperability and openness

Cash

Impact of technology Shift to born-digital content (and surrogates) Publication not necessarily the primary outcome

Online resources, communities, networks No “thing” that can be deposited in a library

Internet technologies and standards encroach on many traditional library and academic areas Often more pragmatic, better tested and better supported Open Source and Open Standards

RDF, Semantic Web and linked data Access/reuse of content is a key value proposition Flexibility and extensibility are very useful features

Decreasing unit costs of processing and storage Capacity to create vs capacity to preserve

The internet achieves a high rate of change With engagement by users

Custodians of “knowledge” Knowledge ~ truth (dynamic approximation thereof) An artefact derives much of its meaning from attributes that are not intrinsic to

the artefact itself Context - the circumstances under which it was created Provenance - the route by which it came to be where it is now

This is especially true for digital materials A file is a meaningless stream of bytes The name can be readily changed – it is not intrinsic The file format is not intrinsic – text vs XML vs HTML vs TeX

The “metadata” alone can have more meaning than the “data” alone Can we even unambiguously define metadata?

Image vs transcription vs abstract vs description

A digital object should be considered a greater whole comprising several streams of information that can be arbitrarily labelled data or metadata but all of which contribute to the intellectual content of the object

Why libraries? Libraries still hold much original physical and digital

material…

Expertise in long term retention business

Privileged access to the primary sources of new knowledge

To succeed they must engage with the wider information community – use their standards and language, and interoperate.

…and learn how to market themselves!

(Neil Jefferies, Bodleian Libraries, University of Oxford)

Dean B. Krafft, Cornell University Library

Tom Cramer, Stanford University Libraries

Linked Data for Libraries (LD4L) Andrew W. Mellon Foundation made a two-year $999K grant to Cornell,

Harvard, and Stanford Develop an ontology and linked data sources that provide relationships,

metadata, and broad context for Scholarly Information Resources Leverages existing work by the VIVO project and the Hydra Partnership

Vision: Create a LOD standard to exchange all that libraries know about their resources

“The goal is to create a Scholarly Resource Semantic Information Store model that works both within individual

institutions and through a coordinated, extensible network of Linked Open Data to capture the intellectual value that librarians and other domain experts add to information

resources when they describe, annotate, organize, select, and use those resources, together with the social value evident

from patterns of usage.”

What is VIVO? Software: An open-source semantic-web-based researcher and research

discovery tool Data: Institution-wide, publicly-visible information about research and

researchers Human and machine readable Search and query

Standards: A standard ontology (VIVO data) that interconnects researchers, communities, and campuses using Linked Open Data Relationships – augment queries

e.g.: All researchers involved with any gene associated with breast cancer (through research project, publication, etc.)

Simple reasoning to categorise and find associations e.g: Teaching faculty = any faculty member teaching a course

Community: An open community with strong national and international participation

VIVO connects scientists and scholars with and through their research and scholarship

How LD4L builds on VIVO LD4L brings the relationship and identifier-based

architecture of VIVO to mainstream library use cases and applications

The LD4L ontology will draw on VIVO-ISF ontology design patterns (among others)

Vitro as a semantic web browser will play a role in LD4L infrastructure along with more specialized, purpose-built tools

The multi-institution LD4L demonstration search will be an adaptation of VIVOsearch.org

LD4L will link to existing VIVO data in addition to Harvard Faculty Finder and Stanford CAP data

BIBFRAME The Library of Congress (LoC) has developed the

BIBFRAME ontology as an (eventual) replacement for MARC, the current cataloging standard for library resources The Library of Congress BIBFRAME initiative “provides a

foundation for the future of bibliographic description, both on the web, and in the broader networked world.”

Both LoC and Zepheira, a contractor, have developed converters that produce BIBFRAME RDF from MARC XML We are building from existing library catalog data in MARC

and want to mainstream the use of identifiers and linked data in our library workflows

BIBFRAME Basic Concepts (Creative) Work - conceptual

essence of a cataloguing resource Instance – individual material

embodiment of a Work Authority – key authority concepts

that have defined relationships reflected in a Work or Instance. Examples are: People, Places, Organisations, Topics etc. Domain–entity taking responsibility for the recognition, organisation and maintenance of the authoritative resources

Annotation – enhances our knowledge about another resource. Source of the annotation is critical. Examples are: Library Holdings, Cover Art, Reviews etc.

Adding identifiers Translating MARC into RDF does not make useful linked

data Identifiers (URI’s) are essential

Local identifiers for statements made by an institution, both local authority information and annotation

Global identifiers for people, organizations, and places where they can be reliably determined OCLC Work URIs for shared works across institutions VIAF, ORCID for people Evaluating DBpedia for place linkages

Goal: From strings to things People, Organizations, Places, Subjects, Events, Works,

Datasets

Leveraging OCLC work identifiers OCLC WorldCat functions as a union catalog of bibliographic

identifiers shared across institutions

Goal is to reference common OCLC Work URIs in bibliographic resources from Cornell, Harvard, and Stanford to support common search discovery and interlinking Harvard: 82% of 13.6 million bib records can be matched to OCLC

Work identifiers

Stanford and Cornell have ~2.6 million records in common out of ~5.8 million in each collection

Annotations and usage information can then be compared across 3 institutions

If it can work for 3, it can work for many

Likely components of the LD4L ontology Library resources: BIBFRAME

Additional bibliographic types and partonomyrelationships: FaBiO

People/Organizations: VIVO-ISF (includes FOAF)

Annotations: OpenAnnotation

Provenance: PAV

Virtual Collections and Structured Relationships: OAI-ORE

Concepts: SKOS (or vocabularies such as Getty with stable URIs)

Many identifiers: VIAF, ORCID, ISNI, OCLC Works

Entity Reconciliation Locally critical to link information across library system

silos

Essential to link across the three partners to support discovery, annotation, virtual collections: works, people, places, subjects, etc.

Linking to web of LOD surfaces new relationships and networks

Library role: expose our own unique entities and connect them to the rest of the world

The more we can link, the more we can discover

How will LD4L make these connections? By using ontologies commonly found in linked data

By connecting with Cornell VIVO/Stanford CAP/Harvard Profiles information

By using persistent, stable local identifiers (URIs)

By linking stable local identifiers to global identifiers (ORCID, VIAF, ISNI)

By supporting annotations with provenance

By linking to external sources of networked relationships: Dbpedia, IMDB, OCLC

LD4L Data Sources

LD4L Approach

What Is Hydra? A robust repository fronted

by feature-rich, tailored applications and workflows (“heads”) One body, many heads

Collaboratively built “solution bundles” that can be adapted and modified to suit local needs.

A community of developers and adopters extending and enhancing the core If you want to go fast, go

alone. If you want to go far, go together.

Fedora provides a durable repository layer to support object management and persistence

Solr provides fast access to indexed information

Blacklight, a Ruby on Rails plugin that sits atop solr and provides faceted search & tailored views on objects

Hydra-Head, a Ruby on Rails plugin that provides create, update and delete actions against Fedora objects

How LD4L builds on Hydra We will augment the ActiveTriples gem to mimic

ActiveFedora

We will write code to store Open Annotations (OA) linked data in Fedora 4, natively

We will use Blacklight as a UI for making/viewing annotations, and for searching data indexed from LD4L triple stores

We will leverage the Questioning Authority Gem for Use Case 3.4: LOD-based Data Entry

Timeline so far Jan-June 2014: Initial ontology design; identify data sources; identify

external vocabularies; begin LD4L and Hydra ActiveTriplesdevelopment

July-Dec 2014: Complete initial ontology; complete initial ActiveTriples development; pilot initial data ingests into Vitro-based LD4L instance at Cornell

February 2015: Hold a two-day by invitation workshop for 25 attendees from 10-12 interested library, archive, and cultural memory institutions Demonstrate initial prototypes of LD4L and ontology Obtain feedback on initial ontology design Obtain feedback on overall design and approach Make connections to support participants in piloting this approach at their

institutions Understand how institutions see this approach fitting in with their own multi-

institutional collaborations and existing cross-institutional efforts such as the Digital Public Library of America, VIVO, and SHARE

Timeline 2015 Jan-June 2015

Pilot LD4L instances at Harvard and Stanford Populate Cornell LD4L instance from multiple data sources

including MARC catalog records, EAD finding aids, VIVO data, CuLLR, and local digital collections

Develop a test instance of the LD4L Search application harvesting RDF across the three partner institutions

Integrate LD4L with ActiveTriples

July-Dec 2015 Implement fully functional LD4L instances at Cornell, Harvard, and

Stanford Public release of open source LD4L code and ontology Public release of open source ActiveTriples Hydra Component Create public demonstration of LD4L Search-based discovery and

access system across the three LD4L instances

LD4L Project Outcomes Open source extensible LD4L ontology compatible

with VIVO ontology, BIBFRAME, and other existing library LOD efforts

Open source LD4L semantic editing, display, and discovery system

Project Hydra compatible interface to LD4L, using ActiveTriples to support Blacklight search across multiple LD4L instances

Neil Jefferies, Bodleian Libraries, University of Oxford

Tanya Gray Jones, Bodleian Libraries, University of Oxford

Examining our digital collections Bodleian has around 150 online digital collections

Images, typography, data, digitised books, texts, etc. Many are structured are the result of scholarly rather than library

projects

Common objects reappear in many places: Items – Works, (Manifestations) Artefects, Components Labels – Classifications, Vocabularies, Ontologies, Names, Attribute

values Sort and group items These are vital for discovery (not everything is full-text indexable)

Context – Places, People, Geopolitical entities, Collections Locate items

It is *possible* for something to be more than one types of object Fictitious creations, automata

Is it possible to construct a data model that can capture everything that we have already, let alone consider what

we might get in the future?

Important Considerations The Model should fit the Knowledge

If you are working hard to make your information fit then you are using the wrong approach

Don’t sacrifice accuracy for conformance Standards have implicit biases and assumptions

Affects the types of question that can be asked or answered

Efficiency matters!

Preservation Economics of re-use File format choice “Significant properties” Metadata is critical*

Re-use Final format vs continued use Cannot anticipate how Most potential users not born

No need for a single approach Standards suffer from scope creep

Handle their initial design targets well …and everything else rather less so

Sooner or later your information will become graph-like RDF types relationships, unlike an

RDBMS RDF (like many standards) can

technically encode almost anything but…

Different knowledge types are best treated differently Mashing it all together is confusing and

reduces reusability It is also inefficient There are existing standards

(W3C/IETF > DH > Library)

Author

vCard

Digitised Book

MODS (Bibliographic)

PREMIS (Preservation)

CC-BY-SA (Rights)

Text (Abstract)

Images

EXIF (photographic)

ALTO (text coordinates)

Text (OCR Output)

JPEG (Image)

TIFF (Image)

Data and Metadata Questions?

Context

Provenance

Evidence

Qualification

Original Context The original context in which the

artefact was created Current context is the product of

provenance

Who created it? Author, illustrator, scribe,

typesetter, printer, publisher?

Why did they create it? How did they create it? Where and when did they create

it? What was going on when they

created it?

Context

Gives meaning to…

Artefect

Provides evidence for

Context is shared•The Paradise of Dainty Devises

•???

•Chemistry of Insulin: determination of the structure of insulin opens the way to greater understanding of life processes

•Nucleotide sequence of 5S-ribosomal RNA from Escherichia coli

•Hitchikers Guide to the Galaxy

•The Restaurant at the End of the Univerese

•So Long and Thanks for all the Fish

Provenance How an artefact came to where and how it is now?

How a digital surrogate was created/curated etc.

Digital and physical in parallel Conservation and preservation applies to both

The basic questions are framed in similar terms to original context but with an emphasis on Time and Process The original context is just the early part of provenance!

<HTML><HEAD><Title>Alice's Adventures in Wonderland --

Chapter I</Title></HEAD><BODY>

Alice was beginning to get very tired of sitting by her sister on the bank,

and of having nothing to do: once or twice she had peeped into the book

her sister...

Provenance/Context Models Key components:

Objects (entities, things…) Events located in space and time Agents: Create/change other entities/relationships Items: artefacts, people, places Labels: Classifications, ontologies, ideas, bibliographic works Groups: Organisations, collections (geopolitical constructs)

Relationships (typed)

CIDOC-CRM – www.cidoc-crm.org – ISO 21127 UNESCO/International Council of Museums

Schema.org – www.schema.org (roles and events recently added!) Google, Bing!, Yahoo etc.

TEI – www.tei-c.org PREMIS – www.loc.gov/standards/premis

Preservation metadata originally…(3.X is a significant revision)

• Participates InAgent

• Which changesEvent

Item

http://www.cidoc-crm.org/

http://www.schema.org/

http://www.tei-c.org/

http://www.loc.gov/standards/premis

Evidence Data models are about assertions *NOT* truth or

reality! Provenance of assertions about objects matters – this

is a key mechanism of scholarship: Who made the assertion? When? On what basis?

Assertions may be multiple or contradictory Some use cases attempt to compute confidence or

probability values (!)

In practice… This can be and is ignored for some cases (intrinsic

properties of an object) This is often the starting point for further research

(library catalogue, pre-existing data)

Expressing Evidence Most evidence can be accommodated by adopting an event-oriented

expression of information The mechanism used for expressing context and provenance also works

here

Manuscript

• Author

• AuthorBirthDate

• PlaceofCreation

• DateofCreation

• EvidenceForAuthorPlaceDateOfCreation

• Title

• Abstract

• CurrentLocation

• DateOfDepositAtCurrentLocation

• EvidenceForDateOfDepositAtCurrentLocation

Deposit Event

Time Place Evidence

Manuscript

Title Abstract

Creation Event

Time Place Evidence

Author

BirthDate BirthPlace

Another Viewpoint We can reframe the previous discussion in terms of a general

need to be able to qualify an assertion in terms one or more of: Time – When an assertion is true

An obvious case, the existence of a person

Place – Where an assertion is true Professor of History at Oxford <> Professor of History at Heidelberg Places can be geopolitical entities such as jurisdictions

Which are themselves time dependent

Source – Who made the assertion An anonymous text is a valid source though

Evidence - Why the assertion has been made …and counter-evidence too

Confidence – How much can the assertion be trusted Often depends on the source and evidence

Different Knowledge TypesDerived

Knowledge• History

Meaning

• Relate Semantic Elements to other objects

• Who, When, Where

• Iconography

Context

• Immediate information available from the object environment

• Metadata!

• Creator, Location in Library, Accession

• Documented Provenance

Semantic Elements

• Meaningful chunks of content

• Titles/Subtitles, Personal Names, Place Names, Contents Lists, Indices, Dates

• Image Components

Intrinsic Information

• Raw information content

• Raw Text, Lines, Headers, Pagination, Images

• “Text Coordinates”

Physical Attributes

• Material, Page Size, Font, Colour

Increasing UncertaintyNeed for Qualification

The PROV Ontology W3C standard http://www.w3.org/TR/prov-o

http://www.w3.org/TR/prov-o

Relationship in Context The basic Prov-O relationships are rather generic so they need to

be qualified Roles define how an entity relates to an activity

Entity includes agents

Start Modelling… I have not discussed how data is actually

captured and stored – this is intentional and should not be considered until you… Understand what information you have Understand what questions you want to

answer Understand what tools you have available Understand what additional information

you need to acquire

The data modelling process will help with some of these (to some extent – no promises) RDF can be represented in many different

ways Non-RDF data

Where possible, consider expressing your outputs in a similar manner – this will enrich the basic dataset and allow further development

(It’s only a model)

http://camelot-dev.bodleian.ox.ac.uk/

http://camelot-dev.bodleian.ox.ac.uk/

Linked Data is key It is a Web standard rather than library standards

Usage, tools and expertise

Flexible and extensible to adapt to future needs

Common ontologies and mappings are essential

It defines a clear mechanism for interoperability

Reconciliation of shared entities is essential

It is machine readable and amenable to machine reasoning

Data enrichment and analytics

Emergent Data Model The objects we must store, curate and preserve go far

beyond Works and Instances

Agents: People, Organisations, Automata (software and instruments)

Coordinates: Events, Locations (Physical and geopolitical)

Classification: Keywords, Vocabularies, Subject headings

Structural: Components, Fragments

The relationships between these objects are rich, and vary with time and context

All of these obejects have their own provenance and history

Libraries must be proactive Conflation of source data, research environment and

publication platform

Information no longer automatically comes to libraries

Librarians need to understand data and linked data in their domain

Interdisciplinary research makes this challenging

Libraries have marketable strengths

Search and analytics

Long term retention

Documents

Neil Jefferies Bodleian Libraries - OCLC · 2020-04-29 · BIBFRAME The Library of Congress (LoC) has developed the BIBFRAME ontology as an (eventual) replacement for MARC, the current