Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Neil Jefferies
Bodleian Libraries
Structure Analysis
Deconstructing Digital Libraries
Contextual Data Models
Two projects
Linked Data for Libraries (LD4L)
CAMELOT
Conclusions
Neil Jefferies, Bodleian Libraries, University of Oxford
Deconstructing digital libraries What are the salient features of the content?
Tablets? Scrolls? Books? Journals? Images? Datasets?
Ideas and the development of ideas Data vs Facts vs Information vs Knowledge The artefects of “intellectual discourse”
Libraries are not archives or museums Discovery, access, reach and accessibility Low latency
History Custodians of knowledge Content generator as well as repository The “scholar librarian”
Nature of intellectual discourse Artefacts/mechanisms of discourse are changing...quickly
Books, letters, papers, websites, social media, linked-data
Reduced turnaround times
Desire for currency of content shortens “publication” cycle
Rise of interdisciplinary/collaborative projects
Reduce silo-ization
Broad discovery and linking
What about “subject specialists”
Interoperability and openness
Cash
Impact of technology Shift to born-digital content (and surrogates) Publication not necessarily the primary outcome
Online resources, communities, networks No “thing” that can be deposited in a library
Internet technologies and standards encroach on many traditional library and academic areas Often more pragmatic, better tested and better supported Open Source and Open Standards
RDF, Semantic Web and linked data Access/reuse of content is a key value proposition Flexibility and extensibility are very useful features
Decreasing unit costs of processing and storage Capacity to create vs capacity to preserve
The internet achieves a high rate of change With engagement by users
Custodians of “knowledge” Knowledge ~ truth (dynamic approximation thereof) An artefact derives much of its meaning from attributes that are not intrinsic to
the artefact itself Context - the circumstances under which it was created Provenance - the route by which it came to be where it is now
This is especially true for digital materials A file is a meaningless stream of bytes The name can be readily changed – it is not intrinsic The file format is not intrinsic – text vs XML vs HTML vs TeX
The “metadata” alone can have more meaning than the “data” alone Can we even unambiguously define metadata?
Image vs transcription vs abstract vs description
A digital object should be considered a greater whole comprising several streams of information that can be arbitrarily labelled data or metadata but all of which contribute to the intellectual content of the object
Why libraries? Libraries still hold much original physical and digital
material…
Expertise in long term retention business
Privileged access to the primary sources of new knowledge
To succeed they must engage with the wider information community – use their standards and language, and interoperate.
…and learn how to market themselves!
(Neil Jefferies, Bodleian Libraries, University of Oxford)
Dean B. Krafft, Cornell University Library
Tom Cramer, Stanford University Libraries
Linked Data for Libraries (LD4L) Andrew W. Mellon Foundation made a two-year $999K grant to Cornell,
Harvard, and Stanford Develop an ontology and linked data sources that provide relationships,
metadata, and broad context for Scholarly Information Resources Leverages existing work by the VIVO project and the Hydra Partnership
Vision: Create a LOD standard to exchange all that libraries know about their resources
“The goal is to create a Scholarly Resource Semantic Information Store model that works both within individual
institutions and through a coordinated, extensible network of Linked Open Data to capture the intellectual value that librarians and other domain experts add to information
resources when they describe, annotate, organize, select, and use those resources, together with the social value evident
from patterns of usage.”
What is VIVO? Software: An open-source semantic-web-based researcher and research
discovery tool Data: Institution-wide, publicly-visible information about research and
researchers Human and machine readable Search and query
Standards: A standard ontology (VIVO data) that interconnects researchers, communities, and campuses using Linked Open Data Relationships – augment queries
e.g.: All researchers involved with any gene associated with breast cancer (through research project, publication, etc.)
Simple reasoning to categorise and find associations e.g: Teaching faculty = any faculty member teaching a course
Community: An open community with strong national and international participation
VIVO connects scientists and scholars with and through their research and scholarship
How LD4L builds on VIVO LD4L brings the relationship and identifier-based
architecture of VIVO to mainstream library use cases and applications
The LD4L ontology will draw on VIVO-ISF ontology design patterns (among others)
Vitro as a semantic web browser will play a role in LD4L infrastructure along with more specialized, purpose-built tools
The multi-institution LD4L demonstration search will be an adaptation of VIVOsearch.org
LD4L will link to existing VIVO data in addition to Harvard Faculty Finder and Stanford CAP data
BIBFRAME The Library of Congress (LoC) has developed the
BIBFRAME ontology as an (eventual) replacement for MARC, the current cataloging standard for library resources The Library of Congress BIBFRAME initiative “provides a
foundation for the future of bibliographic description, both on the web, and in the broader networked world.”
Both LoC and Zepheira, a contractor, have developed converters that produce BIBFRAME RDF from MARC XML We are building from existing library catalog data in MARC
and want to mainstream the use of identifiers and linked data in our library workflows
BIBFRAME Basic Concepts (Creative) Work - conceptual
essence of a cataloguing resource Instance – individual material
embodiment of a Work Authority – key authority concepts
that have defined relationships reflected in a Work or Instance. Examples are: People, Places, Organisations, Topics etc. Domain–entity taking responsibility for the recognition, organisation and maintenance of the authoritative resources
Annotation – enhances our knowledge about another resource. Source of the annotation is critical. Examples are: Library Holdings, Cover Art, Reviews etc.
Adding identifiers Translating MARC into RDF does not make useful linked
data Identifiers (URI’s) are essential
Local identifiers for statements made by an institution, both local authority information and annotation
Global identifiers for people, organizations, and places where they can be reliably determined OCLC Work URIs for shared works across institutions VIAF, ORCID for people Evaluating DBpedia for place linkages
Goal: From strings to things People, Organizations, Places, Subjects, Events, Works,
Datasets
Leveraging OCLC work identifiers OCLC WorldCat functions as a union catalog of bibliographic
identifiers shared across institutions
Goal is to reference common OCLC Work URIs in bibliographic resources from Cornell, Harvard, and Stanford to support common search discovery and interlinking Harvard: 82% of 13.6 million bib records can be matched to OCLC
Work identifiers
Stanford and Cornell have ~2.6 million records in common out of ~5.8 million in each collection
Annotations and usage information can then be compared across 3 institutions
If it can work for 3, it can work for many
Likely components of the LD4L ontology Library resources: BIBFRAME
Additional bibliographic types and partonomyrelationships: FaBiO
People/Organizations: VIVO-ISF (includes FOAF)
Annotations: OpenAnnotation
Provenance: PAV
Virtual Collections and Structured Relationships: OAI-ORE
Concepts: SKOS (or vocabularies such as Getty with stable URIs)
Many identifiers: VIAF, ORCID, ISNI, OCLC Works
Entity Reconciliation Locally critical to link information across library system
silos
Essential to link across the three partners to support discovery, annotation, virtual collections: works, people, places, subjects, etc.
Linking to web of LOD surfaces new relationships and networks
Library role: expose our own unique entities and connect them to the rest of the world
The more we can link, the more we can discover
How will LD4L make these connections? By using ontologies commonly found in linked data
By connecting with Cornell VIVO/Stanford CAP/Harvard Profiles information
By using persistent, stable local identifiers (URIs)
By linking stable local identifiers to global identifiers (ORCID, VIAF, ISNI)
By supporting annotations with provenance
By linking to external sources of networked relationships: Dbpedia, IMDB, OCLC
LD4L Data Sources
LD4L Approach
What Is Hydra? A robust repository fronted
by feature-rich, tailored applications and workflows (“heads”) One body, many heads
Collaboratively built “solution bundles” that can be adapted and modified to suit local needs.
A community of developers and adopters extending and enhancing the core If you want to go fast, go
alone. If you want to go far, go together.
Fedora provides a durable repository layer to support object management and persistence
Solr provides fast access to indexed information
Blacklight, a Ruby on Rails plugin that sits atop solr and provides faceted search & tailored views on objects
Hydra-Head, a Ruby on Rails plugin that provides create, update and delete actions against Fedora objects
How LD4L builds on Hydra We will augment the ActiveTriples gem to mimic
ActiveFedora
We will write code to store Open Annotations (OA) linked data in Fedora 4, natively
We will use Blacklight as a UI for making/viewing annotations, and for searching data indexed from LD4L triple stores
We will leverage the Questioning Authority Gem for Use Case 3.4: LOD-based Data Entry
Timeline so far Jan-June 2014: Initial ontology design; identify data sources; identify
external vocabularies; begin LD4L and Hydra ActiveTriplesdevelopment
July-Dec 2014: Complete initial ontology; complete initial ActiveTriples development; pilot initial data ingests into Vitro-based LD4L instance at Cornell
February 2015: Hold a two-day by invitation workshop for 25 attendees from 10-12 interested library, archive, and cultural memory institutions Demonstrate initial prototypes of LD4L and ontology Obtain feedback on initial ontology design Obtain feedback on overall design and approach Make connections to support participants in piloting this approach at their
institutions Understand how institutions see this approach fitting in with their own multi-
institutional collaborations and existing cross-institutional efforts such as the Digital Public Library of America, VIVO, and SHARE
Timeline 2015 Jan-June 2015
Pilot LD4L instances at Harvard and Stanford Populate Cornell LD4L instance from multiple data sources
including MARC catalog records, EAD finding aids, VIVO data, CuLLR, and local digital collections
Develop a test instance of the LD4L Search application harvesting RDF across the three partner institutions
Integrate LD4L with ActiveTriples
July-Dec 2015 Implement fully functional LD4L instances at Cornell, Harvard, and
Stanford Public release of open source LD4L code and ontology Public release of open source ActiveTriples Hydra Component Create public demonstration of LD4L Search-based discovery and
access system across the three LD4L instances
LD4L Project Outcomes Open source extensible LD4L ontology compatible
with VIVO ontology, BIBFRAME, and other existing library LOD efforts
Open source LD4L semantic editing, display, and discovery system
Project Hydra compatible interface to LD4L, using ActiveTriples to support Blacklight search across multiple LD4L instances
Neil Jefferies, Bodleian Libraries, University of Oxford
Tanya Gray Jones, Bodleian Libraries, University of Oxford
Examining our digital collections Bodleian has around 150 online digital collections
Images, typography, data, digitised books, texts, etc. Many are structured are the result of scholarly rather than library
projects
Common objects reappear in many places: Items – Works, (Manifestations) Artefects, Components Labels – Classifications, Vocabularies, Ontologies, Names, Attribute
values Sort and group items These are vital for discovery (not everything is full-text indexable)
Context – Places, People, Geopolitical entities, Collections Locate items
It is *possible* for something to be more than one types of object Fictitious creations, automata
Is it possible to construct a data model that can capture everything that we have already, let alone consider what
we might get in the future?
Important Considerations The Model should fit the Knowledge
If you are working hard to make your information fit then you are using the wrong approach
Don’t sacrifice accuracy for conformance Standards have implicit biases and assumptions
Affects the types of question that can be asked or answered
Efficiency matters!
Preservation Economics of re-use File format choice “Significant properties” Metadata is critical*
Re-use Final format vs continued use Cannot anticipate how Most potential users not born
No need for a single approach Standards suffer from scope creep
Handle their initial design targets well …and everything else rather less so
Sooner or later your information will become graph-like RDF types relationships, unlike an
RDBMS RDF (like many standards) can
technically encode almost anything but…
Different knowledge types are best treated differently Mashing it all together is confusing and
reduces reusability It is also inefficient There are existing standards
(W3C/IETF > DH > Library)
Author
vCard
Digitised Book
MODS (Bibliographic)
PREMIS (Preservation)
CC-BY-SA (Rights)
Text (Abstract)
Images
EXIF (photographic)
ALTO (text coordinates)
Text (OCR Output)
JPEG (Image)
TIFF (Image)
Data and Metadata Questions?
Context
Provenance
Evidence
Qualification
Original Context The original context in which the
artefact was created Current context is the product of
provenance
Who created it? Author, illustrator, scribe,
typesetter, printer, publisher?
Why did they create it? How did they create it? Where and when did they create
it? What was going on when they
created it?
Context
Gives meaning to…
Artefect
Provides evidence for
Context is shared•The Paradise of Dainty Devises
•???
•Chemistry of Insulin: determination of the structure of insulin opens the way to greater understanding of life processes
•Nucleotide sequence of 5S-ribosomal RNA from Escherichia coli
•Hitchikers Guide to the Galaxy
•The Restaurant at the End of the Univerese
•So Long and Thanks for all the Fish
Provenance How an artefact came to where and how it is now?
How a digital surrogate was created/curated etc.
Digital and physical in parallel Conservation and preservation applies to both
The basic questions are framed in similar terms to original context but with an emphasis on Time and Process The original context is just the early part of provenance!
<HTML><HEAD><Title>Alice's Adventures in Wonderland --
Chapter I</Title></HEAD><BODY>
Alice was beginning to get very tired of sitting by her sister on the bank,
and of having nothing to do: once or twice she had peeped into the book
her sister...
Provenance/Context Models Key components:
Objects (entities, things…) Events located in space and time Agents: Create/change other entities/relationships Items: artefacts, people, places Labels: Classifications, ontologies, ideas, bibliographic works Groups: Organisations, collections (geopolitical constructs)
Relationships (typed)
CIDOC-CRM – www.cidoc-crm.org – ISO 21127 UNESCO/International Council of Museums
Schema.org – www.schema.org (roles and events recently added!) Google, Bing!, Yahoo etc.
TEI – www.tei-c.org PREMIS – www.loc.gov/standards/premis
Preservation metadata originally…(3.X is a significant revision)
• Participates InAgent
• Which changesEvent
Item
Evidence Data models are about assertions *NOT* truth or
reality! Provenance of assertions about objects matters – this
is a key mechanism of scholarship: Who made the assertion? When? On what basis?
Assertions may be multiple or contradictory Some use cases attempt to compute confidence or
probability values (!)
In practice… This can be and is ignored for some cases (intrinsic
properties of an object) This is often the starting point for further research
(library catalogue, pre-existing data)
Expressing Evidence Most evidence can be accommodated by adopting an event-oriented
expression of information The mechanism used for expressing context and provenance also works
here
Manuscript
• Author
• AuthorBirthDate
• PlaceofCreation
• DateofCreation
• EvidenceForAuthorPlaceDateOfCreation
• Title
• Abstract
• CurrentLocation
• DateOfDepositAtCurrentLocation
• EvidenceForDateOfDepositAtCurrentLocation
Deposit Event
Time Place Evidence
Manuscript
Title Abstract
Creation Event
Time Place Evidence
Author
BirthDate BirthPlace
Another Viewpoint We can reframe the previous discussion in terms of a general
need to be able to qualify an assertion in terms one or more of: Time – When an assertion is true
An obvious case, the existence of a person
Place – Where an assertion is true Professor of History at Oxford <> Professor of History at Heidelberg Places can be geopolitical entities such as jurisdictions
Which are themselves time dependent
Source – Who made the assertion An anonymous text is a valid source though
Evidence - Why the assertion has been made …and counter-evidence too
Confidence – How much can the assertion be trusted Often depends on the source and evidence
Different Knowledge TypesDerived
Knowledge• History
Meaning
• Relate Semantic Elements to other objects
• Who, When, Where
• Iconography
Context
• Immediate information available from the object environment
• Metadata!
• Creator, Location in Library, Accession
• Documented Provenance
Semantic Elements
• Meaningful chunks of content
• Titles/Subtitles, Personal Names, Place Names, Contents Lists, Indices, Dates
• Image Components
Intrinsic Information
• Raw information content
• Raw Text, Lines, Headers, Pagination, Images
• “Text Coordinates”
Physical Attributes
• Material, Page Size, Font, Colour
Increasing UncertaintyNeed for Qualification
The PROV Ontology W3C standard http://www.w3.org/TR/prov-o
Relationship in Context The basic Prov-O relationships are rather generic so they need to
be qualified Roles define how an entity relates to an activity
Entity includes agents
Start Modelling… I have not discussed how data is actually
captured and stored – this is intentional and should not be considered until you… Understand what information you have Understand what questions you want to
answer Understand what tools you have available Understand what additional information
you need to acquire
The data modelling process will help with some of these (to some extent – no promises) RDF can be represented in many different
ways Non-RDF data
Where possible, consider expressing your outputs in a similar manner – this will enrich the basic dataset and allow further development
(It’s only a model)
http://camelot-dev.bodleian.ox.ac.uk/
Linked Data is key It is a Web standard rather than library standards
Usage, tools and expertise
Flexible and extensible to adapt to future needs
Common ontologies and mappings are essential
It defines a clear mechanism for interoperability
Reconciliation of shared entities is essential
It is machine readable and amenable to machine reasoning
Data enrichment and analytics
Emergent Data Model The objects we must store, curate and preserve go far
beyond Works and Instances
Agents: People, Organisations, Automata (software and instruments)
Coordinates: Events, Locations (Physical and geopolitical)
Classification: Keywords, Vocabularies, Subject headings
Structural: Components, Fragments
The relationships between these objects are rich, and vary with time and context
All of these obejects have their own provenance and history
Libraries must be proactive Conflation of source data, research environment and
publication platform
Information no longer automatically comes to libraries
Librarians need to understand data and linked data in their domain
Interdisciplinary research makes this challenging
Libraries have marketable strengths
Search and analytics
Long term retention