Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Supporting Persistent Citation

Webcast 4pm 11 December 2006

John Kunze, California Digital Library

California Digital Library

• A university library with no books, students, orfaculty

• Central services for 10 campuses– 208,000 students– 121,000 faculty and staff– 100+ libraries and museums

• Content hosting: electronic texts, web-basedmaterial, finding aids, scanned book content(OCA, Google), datasets

What’s digital preservation?

Storing digital objects while retaining a balanceof usability and faithfulness (truthiness) totheir creators’ original intentions

What’s digital preservation?

Storing digital objects while retaining a balance ofusability and faithfulness (truthiness) to theircreators’ original intentions

Truthiness defined by the Designated Community

Kinds of loss

• Hard loss - some or all data bits are missing• Soft loss - we think we have all the bits …• Syntactic loss - bits are there, but format

cannot be rendered by software• Semantic loss - data renderable without

apparent error, but not understandably• Legal loss - data format or data itself is legally

encumbered

Digital preservation in two parts

• Object: safeguarding its…– Viability (intact bit streams)– Renderability (by machines)– Understandability (by humans)

• Citation: no preservation if we don’t know…– What the object is, i.e., a summary description– What flavor of support (assuming non-accidental

preservation) is intended by the provider– How to get it, or its persistent actionable identifier

Our Stuff vs Their Stuff

• While safeguarding objects is primary, focus now ontools for persistent reference: citations and identifiers

• Persistent reference can be split into– the Our Stuff Problem– the Their Stuff Problem

• No sense assigning persistent ids to Their Stuff– While Their Stuff is hugely important to Our users, we don’t

control it and we cannot vouch for it– Where affordable, we might track Their Stuff (eg, PURLs)

• New focus: persistent reference to Our Stuff

Citations as object surrogatesSurrogates provide a time-honored way of avoiding the inconvenience

of directly handling objects.– Surrogates are usually much smaller (eg, a catalog card)– Unlike the objects, surrogates may be unencumbered and in a language

that you understand– Surrogates are much more uniform (for easier processing) than objects– Every system has surrogates, even if dynamically generated– A surrogate is essentially an object citation

Reminder: What is a citation for?– A citation is a surrogate-based tool to help us find, use, and manage

information objects, resources, or stuff.

Citation metadata

Metadata definition 1: “data about data”– Too broad and too narrow, e.g., a book review? a catalog record for

a statue?Metadata definition 2: “structured data about stuff”

– “stuff” avoids having to say a statue is data– “structured” data assists automation by making it easy to recognize

and record individual data elements– The more uniform, the more leverage for interoperation– Automation + Interoperation ⇒ Protocol

Citation metadata is structured data that is usually, but not always,secondary to, smaller than, and about stuff that is primary

Why citations?

The identifier isn’t enough• Higher confidence through limited redundancy

The object is too much• Often inconvenient to handle objects directly• Need easy handling of diverse objects through

uniform surrogates

Activating citations with protocols:simplicity / functionality pendulumIn the beginning, … application protocols were layered on TCP/IP

– Simplicity: email set the standard (RFC 822 headers)– HTTP, NNTP, gopher, etc. followed its lead; OSI protocols withered

Then: second system syndrome (expanding functionality):– Z39.50, CORBA, SOAP, and others

Regret period (contracting complexity):– OpenSearch, RSS, and in DL world, SRW/SRU, OAI

How are we doing?– Tues 13 June 2006: “low barrier” OAI failures attributed to errors in XML

coding, schemas; poor, inconsistent, and expensive metadata; withsurrogates too non-uniform to be of much use [CL & CL]

Perhaps things are still too complicated?

Simple citation metadata isn’t

Dublin Core metadata tried to be simple– Goal: “specification shouldn’t register on a bathroom scale”

Goal achieved, but spec. was under-specified– No definition of record– No concept of minimal object description– No layout rules for author names and dates– No practical extension framework– No meta-metadata, eg, provenance, commitment statements

15 Dublin Core elementsThought to apply to almost any object – discovery as goal

TitleSubjectSource

LanguageRightsRelationIdentifierPublisherTypeFormatCreatorDescriptionDateContributorCoverage

InstantiationIntellectual PropertyContent

Despite DCMI efforts to correct known problems, the simplestprotocol with the simplest metadata – OAI – reports an overall 36%failure rate, 77% due to metadata/encoding and protocol errors.

“Simple” Dublin Core metadata

<?xml version="1.0"?><!DOCTYPE rdf:RDF PUBLIC "-//DUBLIN CORE//DCMES DTD

2002/07/31//EN" "http://dublincore.org/documents/2002/07/31/dcmes-xml/dcmes-

xml-dtd.dtd"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description

rdf:about=”http://www.nap.edu/books/0309064996/html/”> <dc:title>The Digital Dilemma</dc:title> <dc:creator>National Research Council</dc:creator> <dc:date>2000-06-22</dc:date> </rdf:Description></rdf:RDF>

Same record with Dublin “Kernel”

Here’s the same information, still machine-readable, as anElectronic Resource Citation (ERC) with Kernel metadata:

erc:who: National Research Councilwhat: The Digital Dilemmawhen: 2000where: http://books.nap.edu/html/digital%5Fdilemma

Motivators for the ERC– Meet the need for a simple and manipulable record– Direct human contact with metadata is inevitable– Record should place minimal strain on people– Succinct, transparent, trivially parseable (2 lines of Perl code)

Making it minimal: Kernel/ERCElectronic Resource Citation (ERC) － back to basics• An ERC record is a sequence of elements in email header format:

⇒ label, colon, value• Long values are continued on indented lines• A blank line ends a record

Based on cross-domain kernel distilled from Dublin Core• who － a responsible person or party• what － a name or other human-oriented identifier• when － a date important in the object’s lifecycle• where － a location or a machine-oriented identifier

The ERC notion of “story”The same record as before, in its most compact form:

erc: National Research Council | The Digital Dilemma | 2000 | http://books.nap.edu/html/digital%5Fdilemma

Either ERC form starts by telling the story of an expression of theresource, applying who-what-when-where questions to it.– All 4 kernel elements are required– Absent values must be explained; 7 flavors of “empty”– Element ordering is rigid in compact form (positional semantics)– Arbitrary additional elements may occur after the 4 elements

Other segments in the ERC may introduce other stories, such as,– erc-about, erc-support, erc-from

A 2-story ERC record erc: who: Tomlinson, Richard what: Adjustable knock down chair when: (:unkn) where: http://espacenet.com/dips/bnsviewer%{ ? CY=ec & LG=en & DB=EPD & PN=US5498054 & ID=US+++5498054A1+I+ %} erc-support: who: European Patent Office what: (:permuc) Permanent, Unchanging Content # Note to ops staff: verify date. when: 20010621 where: http://ark.espacenet.com/ark:/23003/US5498054

Mapping ERC to Dublin Core

Coverage (spatial) whereCoverage (temporal) whenSubject whatNone who

erc-aboutIdentifier whereDate whenTitle whatCreator/Contributor/Publisher who

ercEquivalent DC ElementKernel Element

ERC special valuesControlled element values have the form, “(:ccode)”

– e.g., missing: (:unkn) Anonymous, (:unas) Unassigned– e.g., general: (:791) Bee Stings

Sort-friendly values keyed off of initial commawho:, van Gogh, Vincentwho:, Howell, III, PhD, 1922-1987, Thurstonwho:, Mao Tse Tungwhat:, Health and Human Services, United States GovernmentDepartment of, The,

and their equivalents in natural word order:Vincent van GoghThurston Howell, III, PhD, 1922-1987Mao Tse TungThe United States Government Department of Health and

Human Services

ERC dates and expansion blocksERC value with an “expansion” block － “%{“ and “%}”

where: http://foo.bar.org/node%{?db= foo&start = 1&end = 5&buf = 2&query = foo + bar + zaf

%}

is equivalent to the correct and intact URL, where: http://foo.bar.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf

Dates are in TEMPER format1996-2000 (range of four years)1952, 1957, 1969 (list of three years)1952, 1958-1967, 1985 (mixed list of dates & ranges)20001229-20001231 (range of three days)

Kernel/ERC summaryERC is a cheap, general-purpose citation container• It’s kernel metadata is designed to be a low-barrier

way to support orderly management of collections• Might help resource discovery and description too• Succinct, trivial to parse, extensible yet predictable in

the kernel elementsSee http://dublincore.org/groups/kernel/ for more

How to activate an ERC? One way is with THUMP.

Searching and retrievingcitations with THUMP

THUMP: The HTTP URL Mapping Protocol• A set of simple URL-based conventions for retrieving

information and conducting searches• Can be used for focused retrievals or for broad

database searches• Based on commands put in the query string after ‘?’

http://example.foo.com/?in(books)find(war and peace)show(full)

THUMP requestsThe HTTP URL Mapping Protocol (THUMP)

– A protocol based on HTTP and URLs– A request is passed to a server with HTTP GET (or POST)

Shortest request is a URL ending in `?', as in http://example.foo.com/object321?Which is shorthand for the common request: http://example.foo.com/object321?show(brief)as(anvl/erc)

Naked ‘?’ and ‘??’ designed to support the known-itemquery convention arising in the ARK persistent idscheme

THUMP responsesResponses consist of HTTP response headers, one record set header,

and one or more ERC records 1 C: [opens session] C: GET http://ark.cdlib.org/ark:/13030/ft167nb0vq? HTTP/1.1 C: S: HTTP/1.1 200 OK 5 S: Content-Type: text/plain S: THUMP-Status: 0.5 200 OK S: S: set-start: California Digital Library | THUMP 0.5 | 20060606161407 S: | http://ark.cdlib.org/ark:/13030/ft167nb0vq?10 S: | http://dublincore.org/groups/kernel/erc S: here: 1 | 1 | 1 S: S: erc: S: who: Stanton A. Glantz and Edith D. Balbach15 S: what: Tobacco War: Inside the California Battles S: when: 20000510 S: where: http://ark.cdlib.org/ark:/13030/ft167nb0vq S: [closes session]

Broad searching in THUMP

General form of broad queryKey ? in(DB) find(QUERY) list(RANGE) show(ELEMS) as(FORMAT)

Many details to be worked out; watch forhttp://www.ietf.org/internet-drafts/draft-kunze-thump-01.txt

Identifiers and citations

Persistent actionable identifiers should…• Lead to citation• Lead to flavors of permanence• Lead to access (if authorized)

– Not strictly an “identification” problem, but this isthe “404 not found” that we need to fix

• Be valid for some longish period• Be carried on, in, or with the object

Choosing an id scheme

The most perfect identifier scheme will beuseless in the face of service failure– All identifier persistence is purely about service– Identifier confidence depends on service access

Candidate schemes: URL, PURL, URN, ARK,Handle, DOI, MD5, GUID, ISxx, …– Which of these builds in service access points?– Which may actually heighten your risk without

lowering your costs?

Impact of identifier scheme choiceon causes of identifier breakage

NoneNoneNoneNoneNoneNoneProvider removed or replacedthe object

NoneNoneNoneNoneNoneNoneProvider moved object, didn’tupdate forwarding/redirect table

NoneNoneNoneNoneNoneNoneWar, social upheaval, naturaldisaster

NoneNoneNoneNoneNoneNoneLoss of political support, nosuccessor found

NoneNoneNoneNoneNoneNoneBankruptcy or funding loss, nosuccessor found

URNURLPURLHandleDOIARK Identifier schemeCause

Costs of identifier scheme choice

YesYesYesYesYesYesRequires maintaining a two-column forwarding table

YesNoSortaYesYesNoNeeds complex, special-purposeglobal infrastructure or plug-in

SortaYesYesSortaSortaNoRequires special interventionbeyond the reign of HTTP

YesYesYesYesYesYesNeeds web server, web browser,and DNS, or future equivalents

URNURLPURLHandleDOIARK Identifier schemeCost

Benefits of identifier scheme choice

SortaNoNoNoNoYesSeparation of Name Assigningand Mapping Authorities

YesNoSortaYesYesNo/yes

Protection for small institutionsfrom hostname instability

NoNoNoNoNo/yes

YesAPI to object description

NoNoNoNoNoYes/yes

Protection from hyphenation andnamespace splitting

NoNoNoNoNoYesAPI to object persistence policystatement

NoNoNoNoNoYesGlobal lexical inference of logicalsub-objects and object variants

URNURLPURLHandleDOI/Crossref

ARK/ N2T

Identifier schemeBenefit

ARK identifiers at a glanceAn ARK (Archival Resource Key) is a URL with extra structure and

conventions. A sample ARK for a digital object:

http://cdlib.org/ark:/12025/654xz321 \______________/\__/ \___/ \______/ (replaceable) | | | | ARK Label | | | | |1 Name Mapping Authority | 3 Name (NAA-assigned) | 2 Name Assigning Authority Number (NAAN)

1 = current service provider; identity inert “booster rocket”2 = organization that originally assigned the id3 = name originally assigned to the abstract object, often opaque

ARK usageTwo ARKs accessing the same thing http://loc.gov/ark:/12025/654xz321 http://rutgers.edu/ark:/12025/654xz321

Access to metadata -- add a ‘?’ http://loc.gov/ark:/12025/654xz321?

Access to support statement -- add ‘??’ http://loc.gov/ark:/12025/654xz321??

Three minimal requirements to be an ARK– An archive that can’t do all 3 -- trustworthy?– Is an ARK persistent? Maybe. Have to ask.

Transparency vs opaqueness

• Do ARKs have to be this ugly (opaque)? http://foobar.zaf.org/ark:/12025/654xz321/s3/f8.05v.tiff \___________________/ \__/ \___/ \______/ \____________/ NMAH Label NAAN Name Qualifier

• No, but they encourage it. Persistence is all about managingassociations between strings and things– And the landscape is littered with links that were required to die for

political, legal, or social reasons– the appearance, deliberate or even accidental, of once-true

assertions that are now misleading, infringing, offensive makes ithard for our descendants to continue managing

• Pain of managing opaque ids mitigated by strong metadata– But opacity makes for ids that age and travel well– Hybrid: opaque ids name abstract preservation objects, and

semantic/transient extensions address sub-objects

ARK lexical goodies

Hyphens ignored– Neutralizes harm done by typesetters

Too many search results? Providers maydisclose (or not)…– Sub-object hierarchy using reserved ‘/’– Variant objects using reserved ‘.’– Usual %hh (hex encoding) as an escape

Revealing hierarchy in ARKs

Sub-object hierarchy using reserved x/y• Meaning x is containing object for y, for some

defined containment relationship• Works for chapters, sections, pages, etc• Persistence of that relationship?

Saying ark:/12025/654/xz/321 implies also– ark:/12025/654/xz– ark:/12025/654

Revealing variants in ARKs

Sub-object hierarchy using reserved x.y• Meaning x is an object basename with variant

y, for some defined variant relationship• Works for formats, versions, languages, etc• Persistence of that relationship?

Saying ark:/12025/654.20v.78g implies also– ark:/12025/654.20v– ark:/12025/654

ARK namespaces reserved 12025 National Library of Medicine 12026 Library of Congress 12027 National Agriculture Library 13030 California Digital Library 13960 Internet Archive 13038 World Intellectual Property Organization 20775 University of California San Diego 29114 University of California San Francisco 28722 University of California Berkeley 15230 Rutgers University Libraries 64269 UK Digital Curation Centre 62624 New York University Libraries 67531 University of North Texas Libraries 27927 Portico/Ithaka Electronic-Archiving Initiative 12148 National Library of France 78319 Google Book Search

Reserve a namespace by email to [email protected]

Opaque identifier toolsNon-opaque identifier strings are chosen deliberately to

assert some things that are true at the time ofassignment

Opaque identifier strings are best chosen by automatedmeans, such as

• NOID (nice opaque identifier)• Or UUID/GUID (universally unique identifier)

– Sequence of hex encodings of your computer’s MACaddress, current time, and sometimes a random number

– No need to ask permission or register yourself, but based onIEEE and hardware vendor registries

Nice opaque identifiers (NOID)

• A noid minter is a lightweight database forgenerating, tracking, and binding unique ids

• The noid tool creates minters and acceptscommands that operate them– Open source, available at www.cpan.org

• Can mint in random or sequential order, with orwithout a check character guaranteeing against themost common transcription errors

• Anyone can run a noid minter, maintain associationsvia bindings to arbitrary elements (assertions), andset up a resolver (including rule-based)

Using NOID

• Identifiers minted according to a template: noid dbcreate f5.reedeedk long 13030

which produces as first minted id 13030/f54x54g11

• Noid is scheme-independent– Can be used to mint DOIs, URNs, URLs, lotto

numbers, etc.– We (at CDL) use it to mint sequential transaction

ids and randomized ARKs with check chars

Documentation

• ARK specification http://www.ietf.org/internet-drafts/draft-kunze-ark-11.txt

• NOID http://www.cdlib.org/inside/diglib/ark/noid.pdf

Persistent citation next steps

Thinking through how to express whenyou’re citing a snapshot vs a stage thatenacts a thematically consistent thing

Persistent citation for small organizations– Their objects are served from at-risk

hostnames and servers– They need a Name-to-Thing resolver

N2T “Entity” – persistent identifiersfor smaller organizations

Establish a consortium and a small web server.

Each member publishes URLs under n2t.info:

http://n2t.info/12345/foo/bar.zaf

…which redirects to the member server URL.

Why? It solves the same persistent identifierproblem as URN, DOI, and Handle systems,but more fully, and at lower cost and risk.

Persistent identification

Persistent identifiers? We have them.• But still need persistent actionable idsActionable “with widely available tools”• Which really means “with URLs”URNs, ARKs, Handles, DOIs, etc. become

actionable (practically speaking) whenembedded in URLs

• All these ids have similar maintenance costs,and they all break for the usual reasons

The usual reasons

Whatever the string, what matters is the thing• If the thing’s unavailable, the id’s brokenBroken, for URLs, means either• The hostname is broken

– Server down, gone, or renamed *– Domain name lost, provider out of business

• Or the pathname to the thing is broken– Thing down, gone, or renamed *

* No global fix for these, only the provider can fix.

Hostname instability

Domain name lost, provider out of business• We can help this caseSmaller organizations most vulnerable• The comfort of not seeing your hostname• The comfort of seeing your hostname• Traditional solutions: PURL, URN, Handle, DOI

• Solutions tied to special-purpose technology,sometimes complex and proprietary

N2T (Name-to-Thing)

N2T is two things at once• A consortium of cultural memory organizations … and …• A small, ordinary web server, mirrored in several

instances globally for reliability

Basic idea: protect 200 organizations’ URLs fromhostname instability with 200 rewrite rules

How: simple HTTP redirects, one per organization

N2T – user point of viewEach consortium member organization

gets a unique number, such as, 12345.

N2T – system point of view

Technically, resolution (access to a thinggiven its name) is two simple steps.

N2T – consortium point of view

“Consortium-lite”– Members have no fees or responsibilities– One domain name for whole consortium

Rent is $30/year, runs on 4 total web servers

Volunteer member orgs run the servers– 1 primary + 3 mirrors

Interested bodies: CENDI, DLF, DCCInterested institutions: CDL, NYU, NLA, …

N2T – global point of viewRegional (eg, Europe, Asia, North America) clusters of

mirrored resolver instances, with round-robin failoverfor redundancy, fault-tolerance, and load-sharing

• No browser modifications required

User webbrowser

resolverinstance

resolverinstance

resolverinstance

resolverinstance

Namespace Splitting Problem

URLURN/Handle/DOI

resolveruserURL’

URL’’11 12 13 14 15

2 3 5page

webbrowser

URL’6 7 8 9 10

1 4

URL’’2 3 5

page 12345

B

C A

Org’n A’s namespace splits when B and C inherit its objects.Under the URN/Handle/DOI model, B must still forward to C.

Solution: add per-object rulesOver years: add per-object redirects as objects go their separate ways• No server other than N2T has to keep forwarding tablesThis requires periodic (e.g., daily) harvesting of local table mappings

from well-known provider-side server files, e.g., Google sitemaps

globalresolverUser

webbrowser

to/from finalweb server

universitylibrary

nationallibrary

nationalarchive. . .

harvested updates

Prototype resolver

Sample identifiers at n2t.info – these work nowhttp://n2t.info/12345/libraries/visitor.htmlhttp://n2t.info/13030/insidehttp://n2t.info/urn:nbn:se:uu:diva-3324http://n2t.info/ark:/13030/tf5p30086k

Incidentally, it can also redirect all URNs, DOIs,and Handles, e.g.,

http://n2t.info/doi:10.1111/j.0307-6946.2004.00571.x

Advantages of the n2t.infopersistent identifier resolver

• Big reduction in architectural complexity• No browser modification required• Identifier scheme-agnostic• No proprietary, special-purpose infrastructure to carry

forward as a liability to persistence

Identifier myths

What’s an identifier as opposed to a merestring of data? Are these identifiers?

1L039G81UPeter

http://www.cdlib.edu/staff/~greenstein

Identifiers and locations

We want identifiers that are location-independent. Which of these arelocation-independent?

http://dot.ucop.edu/home/jakhttp://n2t.info/14998/xt8732r

http://n2t.info/14998/xt8732r?x=foo&y=bar

Dynamic and static

People complain about dynamicallygenerated pages. Which of these aredynamically generated?

http://dot.ucop.edu/home/jakhttp://n2t.info/14998/xt8732r

http://n2t.info/14998/xt8732r?x=foo&y=bar

Identifier basics

An identifier is an association between a stringand a thing; you cannot tell by looking at it– Strongest association: object wearing its identifier

In the real world, an identifier is also an opinion,and opinions will differ

Identifier persistence is purely a matter of service,not magically conferred by a scheme– Identifier form never imparts anything meaningful for

persistence concerning location or static-vs-dynamicIdentifiers break due to politics, poorly set user

expectations, and low organization awareness

Identifier/citation next stepsCiting a changing object• NLM permanence ratings• Change history (back and forth))• Cheap fixityCongruence between databases and documents• Hierarchy, complexityAnnotations are objects too• Main distinguishing feature being the target objectUnits of citation, delivery, addressing, recall

Documents

Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors