Linked data and voyager

Ed ChamberlainSystems Development Librarian

Cambridge University Library

Disclaimers … Apologies if you see the semantic web

as up there with quantum mechanics …

Will contain some techy stuff

Not that much on Voyager …

Overview Linked data in theory

What we learntIPRDataSupporting technology

How could it be used by Ex Libris?

What is the semantic web? “The Semantic Web is a "man-made woven web of

data" that facilitates machines to understand the semantics, or meaning, of information on the World Wide Web[1][2].”

“The concept of Semantic Web applies methods beyond linear presentation of information (Web 1.0) and multi-linear presentation of information (Web 2.0) to make use of hyper-structures leading to entities of hypertext.”

http://en.wikipedia.org/wiki/Semantic_Web

http://en.wikipedia.org/wiki/Machines

http://en.wikipedia.org/wiki/Semantics

http://en.wikipedia.org/wiki/World_Wide_Web

http://en.wikipedia.org/wiki/Semantic_Web#cite_note-w3c_faq-0

http://en.wikipedia.org/wiki/Semantic_Web#cite_note-1

http://en.wikipedia.org/wiki/Semantic_Web#cite_note-1

http://en.wikipedia.org/wiki/Web_1.0

Eh? Semantic = its meaning is explained -

self-describing data!

Hyperlinked = meaning contextualised elsewhere

Focus on machines rather than people

What is Linked Data … After several iterations of semantic web development …

Tim Berners-Lee has advocated four underlying design principles for linked data:

1. Use URIs as names for things2. Use HTTP URIs so that people can look up those names3. When someone looks up a URI, provide useful information,

using the standards (RDF, SPARQL)4. Include links to other URIs, so that they can discover more

things

http://www.w3.org/DesignIssues/LinkedData.html

And RDF ? The Resource Description Framework

(RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model. It has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources, using a variety of syntax formats.

http://en.wikipedia.org/wiki/Resource_Description_Framework

What does this mean in practice … RDF Data is expressed as triples:

DC XML …<dc:identifer>1000346</dc:identifer><dc:title>Early medieval history of Kashmir : [with special reference to the Loharas] A.D.

1003-1171</dc:title>

Marc21 …001 1000346245$aEarly medieval history of Kashmir : $b[with special reference to the Loharas] A.D.

1003-1171 /

RDF triples …<http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/title> "Early medieval history of Kashmir : [with special reference to the Loharas] A.D. 1003-

1171" .

Most of a record …1. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/title> "Early medieval history of Kashmir : [with special reference to the Loharas] A.D. 1003-1171" .2. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/type> <http://data.lib.cam.ac.uk/id/type/1cb251ec0d568de6a929b520c4aed8d1> .3. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/type> <http://data.lib.cam.ac.uk/id/type/46657eb180382684090fda2b5670335d> .4. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/identifier> "UkCU1000346" .5. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/issued> "1981" .6. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/creator> <http://data.lib.cam.ac.uk/id/entity/cambrdgedb_a5a6f7a184ff02e08b1befedc1b3a4d0> .7. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/language> <http://id.loc.gov/vocabulary/iso639-2/eng> .8. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://RDVocab.info/ElementsplaceOfPublication> <http://id.loc.gov/vocabulary/countries/ii>

Where is the linking exactly? <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346>

<http://purl.org/dc/terms/creator> <http://data.lib.cam.ac.uk/id/entity/cambrdgedb_a5a6f7a184ff02e08b1befedc1b3a4d0>

<http://data.lib.cam.ac.uk/id/entity/cambrdgedb_a5a6f7a184ff02e08b1befedc1b3a4d0> <http://www.w3.org/2000/01/rdf-schema#label> "Mohan, Krishna" . <http://data.lib.cam.ac.uk/id/entity/cambrdgedb_a5a6f7a184ff02e08b1befedc1b3a4d0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> . <http://data.lib.cam.ac.uk/id/entity/cambrdgedb_a5a6f7a184ff02e08b1befedc1b3a4d0> <http://xmlns.com/foaf/0.1#name> "Mohan, Krishna" .

External linking <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/subject>

<http://data.lib.cam.ac.uk/id/entry/cambrdgedb_43e3fa1b4404410454c90d8022578852> .

<http://data.lib.cam.ac.uk/id/entry/cambrdgedb_43e3fa1b4404410454c90d8022578852> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> . <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_43e3fa1b4404410454c90d8022578852> <http://www.w3.org/2004/02/skos/core#inScheme> <http://id.loc.gov/authorities#conceptScheme> . <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_43e3fa1b4404410454c90d8022578852> <http://www.w3.org/2004/02/skos/core#prefLabel> "Lohars -- History" . <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_43e3fa1b4404410454c90d8022578852> <http://purl.org/dc/terms/hasPart> <http://id.loc.gov/authorities/sh85078149#concept> .

Live demo … http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346

Meanwhile … BNB

British Museum

Library of Congress

BBC Nature

The Linking Open Data cloud diagram - http://richard.cyganiak.de/2007/10/lod/

What was COMET? Cambridge Open Metadata

Cambridge University Library / CARET / OCLC

Funded by the JISC Infrastructure for Resource Discovery Project

February to July 2011

http://discovery.ac.uk

What did COMET do …1. Experimentally convert as much of the Cambridge

University Library catalogue as it could from Marc21 to RDF triples

2. Investigate IPR issues around Open License publishing and Marc21

3. Construct an RDF publishing platform to site behind those URI’s …

4. Release tools for others to do the same

5. Blog and documentation

Why? Respond to academic / national demand for

Open Data

Get our data to non-librarians!

Tax-payer value-for-money

CUL already provides public APIs

Gain in-house experience of RDF

Move library services forward

Why - IPR Linked data works best with a

permissive license

CC0 or Public Domain Data License

Non-commercial licenses not suitable

Conflict with record vendors

How – IPR Examine contracts with major vendors

Decide on re-use conditions and contact them

Decode record ownership from Marc21 fields (Could not use Voyager SQL)

How – IPR Where does a record come from ?

Several places in Marc21 where this data could be held (015,035,038,994 …)

Logic and hierarchy for examination

Attempt at scripted analysis – list bib_ids by record vendor

What - IPR Most vendors happy with permissive

license for ‘non-marc21’ formats

RLUK / BL B.N.B. – PDDL

OCLC – ODC-By Attribution license

No good reason not to re-publish – need the right license!

IPR - What did we learn? Marc21 not fit for purpose here, no

‘authoritative code’ for license

National / international mandate to release open data

No good reason not to re-publish – need the right license!

How - data Several attempts – settled on SQL

extracts based on lists of bib_ids

Use Perl scripting to ‘munge’ the data

You can try this at home ! (work)

How - marc problems Punctuation as a function

Binary encoding

Numbers for field names

Bad characters

Replication of data in fields

How – data vocab RDF allows you to freely mix vocabularies

Emerging consensus on bibliographic description

Our conversion script is CSV customisable

BL and others leading the way

How - data publishing Bulk downloads

Queryable ‘endpoints’

Data and code at http://data.lib.cam.ac.uk

How – linking PHP script to match text against LOC

subject headings – enrich with LOC GUID

FAST / VIAF enrichment courtesy of OCLC

Data - What did we learn ? Marc / AACR2 cannot translate will to

semantically rich formats

Need better container / transfer standards (not necessarily RDF)

What else?

RDF friendly database Called RDF stores, triplestores or

Quadstores

Vary in size scale and scope

None are particularly admin / dev friendly right now …

How - SPARQL Query language for RDF stores

Still a work in progress

Some similarities with SQL

Bibliographic-centric tutorial

How –storage and access ARC2 - Lightweight MYSQL / PHP

solution

Good fit for a six month projectGreat for around 3-500 k recordsNot so good for 1 million plus20 million + ?

Supporting tech -What did we learn? Triplestores are cumbersome

SPARQL alone does not do the trick

High entry barrier to RDF is partly a result of these accompanying technologies

What does this mean for Ex Libris Building whole systems around RDF is not really a

good idea

Need the flexibility to do this by dropping Marc21

GUIDS for records (or allow us to have our own) – resolvable ?

Ensure any RDF publishing capacity is flexible (as ours is)

RDF capability for Primo ?

Always add value to RDF … Standalone RDF is just fiddly Dublin Core, so …

Create httpd URI’s for entities

Link it to something useful (LOC, FAST, VIAF)

Endpoint (SPARQL?)

Don’t limit to the bibliographic

Beyond bibliographic

Bibliographic

Holdings

FAST subject headings

Libraries

Transactions

Special collections

Archives

Creator / entity

Place of publication

LCSH subject headings

Course lists

Language

Librarians

Do what Tim said …1. Use URIs as names for things2. Use HTTP URIs so that people can look

up those names3. When someone looks up a URI, provide

useful information, using the standards (RDF, SPARQL)

4. Include links to other URIs, so that they can discover more things

http://www.w3.org/DesignIssues/LinkedData.

html

Questions? @edchamberlain / [email protected]

http://data.lib.cam.ac.uk

http://cul-comet.blogspot.com/

Education

Linked data and voyager