64
Supporting Persistent Citation Webcast 4pm 11 December 2006 John Kunze, California Digital Library

Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Supporting Persistent Citation

Webcast 4pm 11 December 2006

John Kunze, California Digital Library

Page 2: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

California Digital Library

• A university library with no books, students, orfaculty

• Central services for 10 campuses– 208,000 students– 121,000 faculty and staff– 100+ libraries and museums

• Content hosting: electronic texts, web-basedmaterial, finding aids, scanned book content(OCA, Google), datasets

Page 3: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

What’s digital preservation?

Storing digital objects while retaining a balanceof usability and faithfulness (truthiness) totheir creators’ original intentions

Page 4: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

What’s digital preservation?

Storing digital objects while retaining a balance ofusability and faithfulness (truthiness) to theircreators’ original intentions

Truthiness defined by the Designated Community

Page 5: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Kinds of loss

• Hard loss - some or all data bits are missing• Soft loss - we think we have all the bits …• Syntactic loss - bits are there, but format

cannot be rendered by software• Semantic loss - data renderable without

apparent error, but not understandably• Legal loss - data format or data itself is legally

encumbered

Page 6: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Digital preservation in two parts

• Object: safeguarding its…– Viability (intact bit streams)– Renderability (by machines)– Understandability (by humans)

• Citation: no preservation if we don’t know…– What the object is, i.e., a summary description– What flavor of support (assuming non-accidental

preservation) is intended by the provider– How to get it, or its persistent actionable identifier

Page 7: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Our Stuff vs Their Stuff

• While safeguarding objects is primary, focus now ontools for persistent reference: citations and identifiers

• Persistent reference can be split into– the Our Stuff Problem– the Their Stuff Problem

• No sense assigning persistent ids to Their Stuff– While Their Stuff is hugely important to Our users, we don’t

control it and we cannot vouch for it– Where affordable, we might track Their Stuff (eg, PURLs)

• New focus: persistent reference to Our Stuff

Page 8: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors
Page 9: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors
Page 10: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors
Page 11: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Citations as object surrogatesSurrogates provide a time-honored way of avoiding the inconvenience

of directly handling objects.– Surrogates are usually much smaller (eg, a catalog card)– Unlike the objects, surrogates may be unencumbered and in a language

that you understand– Surrogates are much more uniform (for easier processing) than objects– Every system has surrogates, even if dynamically generated– A surrogate is essentially an object citation

Reminder: What is a citation for?– A citation is a surrogate-based tool to help us find, use, and manage

information objects, resources, or stuff.

Page 12: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Citation metadata

Metadata definition 1: “data about data”– Too broad and too narrow, e.g., a book review? a catalog record for

a statue?Metadata definition 2: “structured data about stuff”

– “stuff” avoids having to say a statue is data– “structured” data assists automation by making it easy to recognize

and record individual data elements– The more uniform, the more leverage for interoperation– Automation + Interoperation ⇒ Protocol

Citation metadata is structured data that is usually, but not always,secondary to, smaller than, and about stuff that is primary

Page 13: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Why citations?

The identifier isn’t enough• Higher confidence through limited redundancy

The object is too much• Often inconvenient to handle objects directly• Need easy handling of diverse objects through

uniform surrogates

Page 14: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Activating citations with protocols:simplicity / functionality pendulumIn the beginning, … application protocols were layered on TCP/IP

– Simplicity: email set the standard (RFC 822 headers)– HTTP, NNTP, gopher, etc. followed its lead; OSI protocols withered

Then: second system syndrome (expanding functionality):– Z39.50, CORBA, SOAP, and others

Regret period (contracting complexity):– OpenSearch, RSS, and in DL world, SRW/SRU, OAI

How are we doing?– Tues 13 June 2006: “low barrier” OAI failures attributed to errors in XML

coding, schemas; poor, inconsistent, and expensive metadata; withsurrogates too non-uniform to be of much use [CL & CL]

Perhaps things are still too complicated?

Page 15: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Simple citation metadata isn’t

Dublin Core metadata tried to be simple– Goal: “specification shouldn’t register on a bathroom scale”

Goal achieved, but spec. was under-specified– No definition of record– No concept of minimal object description– No layout rules for author names and dates– No practical extension framework– No meta-metadata, eg, provenance, commitment statements

Page 16: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

15 Dublin Core elementsThought to apply to almost any object – discovery as goal

TitleSubjectSource

LanguageRightsRelationIdentifierPublisherTypeFormatCreatorDescriptionDateContributorCoverage

InstantiationIntellectual PropertyContent

Despite DCMI efforts to correct known problems, the simplestprotocol with the simplest metadata – OAI – reports an overall 36%failure rate, 77% due to metadata/encoding and protocol errors.

Page 17: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

“Simple” Dublin Core metadata

<?xml version="1.0"?><!DOCTYPE rdf:RDF PUBLIC "-//DUBLIN CORE//DCMES DTD

2002/07/31//EN" "http://dublincore.org/documents/2002/07/31/dcmes-xml/dcmes-

xml-dtd.dtd"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description

rdf:about=”http://www.nap.edu/books/0309064996/html/”> <dc:title>The Digital Dilemma</dc:title> <dc:creator>National Research Council</dc:creator> <dc:date>2000-06-22</dc:date> </rdf:Description></rdf:RDF>

Page 18: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Same record with Dublin “Kernel”

Here’s the same information, still machine-readable, as anElectronic Resource Citation (ERC) with Kernel metadata:

erc:who: National Research Councilwhat: The Digital Dilemmawhen: 2000where: http://books.nap.edu/html/digital%5Fdilemma

Motivators for the ERC– Meet the need for a simple and manipulable record– Direct human contact with metadata is inevitable– Record should place minimal strain on people– Succinct, transparent, trivially parseable (2 lines of Perl code)

Page 19: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Making it minimal: Kernel/ERCElectronic Resource Citation (ERC) - back to basics• An ERC record is a sequence of elements in email header format:

⇒ label, colon, value• Long values are continued on indented lines• A blank line ends a record

Based on cross-domain kernel distilled from Dublin Core• who - a responsible person or party• what - a name or other human-oriented identifier• when - a date important in the object’s lifecycle• where - a location or a machine-oriented identifier

Page 20: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

The ERC notion of “story”The same record as before, in its most compact form:

erc: National Research Council | The Digital Dilemma | 2000 | http://books.nap.edu/html/digital%5Fdilemma

Either ERC form starts by telling the story of an expression of theresource, applying who-what-when-where questions to it.– All 4 kernel elements are required– Absent values must be explained; 7 flavors of “empty”– Element ordering is rigid in compact form (positional semantics)– Arbitrary additional elements may occur after the 4 elements

Other segments in the ERC may introduce other stories, such as,– erc-about, erc-support, erc-from

Page 21: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

A 2-story ERC record erc: who: Tomlinson, Richard what: Adjustable knock down chair when: (:unkn) where: http://espacenet.com/dips/bnsviewer%{ ? CY=ec & LG=en & DB=EPD & PN=US5498054 & ID=US+++5498054A1+I+ %} erc-support: who: European Patent Office what: (:permuc) Permanent, Unchanging Content # Note to ops staff: verify date. when: 20010621 where: http://ark.espacenet.com/ark:/23003/US5498054

Page 22: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Mapping ERC to Dublin Core

Coverage (spatial) whereCoverage (temporal) whenSubject whatNone who

erc-aboutIdentifier whereDate whenTitle whatCreator/Contributor/Publisher who

ercEquivalent DC ElementKernel Element

Page 23: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

ERC special valuesControlled element values have the form, “(:ccode)”

– e.g., missing: (:unkn) Anonymous, (:unas) Unassigned– e.g., general: (:791) Bee Stings

Sort-friendly values keyed off of initial commawho:, van Gogh, Vincentwho:, Howell, III, PhD, 1922-1987, Thurstonwho:, Mao Tse Tungwhat:, Health and Human Services, United States GovernmentDepartment of, The,

and their equivalents in natural word order:Vincent van GoghThurston Howell, III, PhD, 1922-1987Mao Tse TungThe United States Government Department of Health and

Human Services

Page 24: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

ERC dates and expansion blocksERC value with an “expansion” block - “%{“ and “%}”

where: http://foo.bar.org/node%{?db= foo&start = 1&end = 5&buf = 2&query = foo + bar + zaf

%}

is equivalent to the correct and intact URL, where: http://foo.bar.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf

Dates are in TEMPER format1996-2000 (range of four years)1952, 1957, 1969 (list of three years)1952, 1958-1967, 1985 (mixed list of dates & ranges)20001229-20001231 (range of three days)

Page 25: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Kernel/ERC summaryERC is a cheap, general-purpose citation container• It’s kernel metadata is designed to be a low-barrier

way to support orderly management of collections• Might help resource discovery and description too• Succinct, trivial to parse, extensible yet predictable in

the kernel elementsSee http://dublincore.org/groups/kernel/ for more

How to activate an ERC? One way is with THUMP.

Page 26: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Searching and retrievingcitations with THUMP

THUMP: The HTTP URL Mapping Protocol• A set of simple URL-based conventions for retrieving

information and conducting searches• Can be used for focused retrievals or for broad

database searches• Based on commands put in the query string after ‘?’

http://example.foo.com/?in(books)find(war and peace)show(full)

Page 27: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

THUMP requestsThe HTTP URL Mapping Protocol (THUMP)

– A protocol based on HTTP and URLs– A request is passed to a server with HTTP GET (or POST)

Shortest request is a URL ending in `?', as in http://example.foo.com/object321?Which is shorthand for the common request: http://example.foo.com/object321?show(brief)as(anvl/erc)

Naked ‘?’ and ‘??’ designed to support the known-itemquery convention arising in the ARK persistent idscheme

Page 28: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

THUMP responsesResponses consist of HTTP response headers, one record set header,

and one or more ERC records 1 C: [opens session] C: GET http://ark.cdlib.org/ark:/13030/ft167nb0vq? HTTP/1.1 C: S: HTTP/1.1 200 OK 5 S: Content-Type: text/plain S: THUMP-Status: 0.5 200 OK S: S: set-start: California Digital Library | THUMP 0.5 | 20060606161407 S: | http://ark.cdlib.org/ark:/13030/ft167nb0vq?10 S: | http://dublincore.org/groups/kernel/erc S: here: 1 | 1 | 1 S: S: erc: S: who: Stanton A. Glantz and Edith D. Balbach15 S: what: Tobacco War: Inside the California Battles S: when: 20000510 S: where: http://ark.cdlib.org/ark:/13030/ft167nb0vq S: [closes session]

Page 29: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Broad searching in THUMP

General form of broad queryKey ? in(DB) find(QUERY) list(RANGE) show(ELEMS) as(FORMAT)

Many details to be worked out; watch forhttp://www.ietf.org/internet-drafts/draft-kunze-thump-01.txt

Page 30: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Identifiers and citations

Persistent actionable identifiers should…• Lead to citation• Lead to flavors of permanence• Lead to access (if authorized)

– Not strictly an “identification” problem, but this isthe “404 not found” that we need to fix

• Be valid for some longish period• Be carried on, in, or with the object

Page 31: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Choosing an id scheme

The most perfect identifier scheme will beuseless in the face of service failure– All identifier persistence is purely about service– Identifier confidence depends on service access

Candidate schemes: URL, PURL, URN, ARK,Handle, DOI, MD5, GUID, ISxx, …– Which of these builds in service access points?– Which may actually heighten your risk without

lowering your costs?

Page 32: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Impact of identifier scheme choiceon causes of identifier breakage

NoneNoneNoneNoneNoneNoneProvider removed or replacedthe object

NoneNoneNoneNoneNoneNoneProvider moved object, didn’tupdate forwarding/redirect table

NoneNoneNoneNoneNoneNoneWar, social upheaval, naturaldisaster

NoneNoneNoneNoneNoneNoneLoss of political support, nosuccessor found

NoneNoneNoneNoneNoneNoneBankruptcy or funding loss, nosuccessor found

URNURLPURLHandleDOIARK Identifier schemeCause

Page 33: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Costs of identifier scheme choice

YesYesYesYesYesYesRequires maintaining a two-column forwarding table

YesNoSortaYesYesNoNeeds complex, special-purposeglobal infrastructure or plug-in

SortaYesYesSortaSortaNoRequires special interventionbeyond the reign of HTTP

YesYesYesYesYesYesNeeds web server, web browser,and DNS, or future equivalents

URNURLPURLHandleDOIARK Identifier schemeCost

Page 34: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Benefits of identifier scheme choice

SortaNoNoNoNoYesSeparation of Name Assigningand Mapping Authorities

YesNoSortaYesYesNo/yes

Protection for small institutionsfrom hostname instability

NoNoNoNoNo/yes

YesAPI to object description

NoNoNoNoNoYes/yes

Protection from hyphenation andnamespace splitting

NoNoNoNoNoYesAPI to object persistence policystatement

NoNoNoNoNoYesGlobal lexical inference of logicalsub-objects and object variants

URNURLPURLHandleDOI/Crossref

ARK/ N2T

Identifier schemeBenefit

Page 35: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

ARK identifiers at a glanceAn ARK (Archival Resource Key) is a URL with extra structure and

conventions. A sample ARK for a digital object:

http://cdlib.org/ark:/12025/654xz321 \______________/\__/ \___/ \______/ (replaceable) | | | | ARK Label | | | | |1 Name Mapping Authority | 3 Name (NAA-assigned) | 2 Name Assigning Authority Number (NAAN)

1 = current service provider; identity inert “booster rocket”2 = organization that originally assigned the id3 = name originally assigned to the abstract object, often opaque

Page 36: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

ARK usageTwo ARKs accessing the same thing http://loc.gov/ark:/12025/654xz321 http://rutgers.edu/ark:/12025/654xz321

Access to metadata -- add a ‘?’ http://loc.gov/ark:/12025/654xz321?

Access to support statement -- add ‘??’ http://loc.gov/ark:/12025/654xz321??

Three minimal requirements to be an ARK– An archive that can’t do all 3 -- trustworthy?– Is an ARK persistent? Maybe. Have to ask.

Page 37: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Transparency vs opaqueness

• Do ARKs have to be this ugly (opaque)? http://foobar.zaf.org/ark:/12025/654xz321/s3/f8.05v.tiff \___________________/ \__/ \___/ \______/ \____________/ NMAH Label NAAN Name Qualifier

• No, but they encourage it. Persistence is all about managingassociations between strings and things– And the landscape is littered with links that were required to die for

political, legal, or social reasons– the appearance, deliberate or even accidental, of once-true

assertions that are now misleading, infringing, offensive makes ithard for our descendants to continue managing

• Pain of managing opaque ids mitigated by strong metadata– But opacity makes for ids that age and travel well– Hybrid: opaque ids name abstract preservation objects, and

semantic/transient extensions address sub-objects

Page 38: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

ARK lexical goodies

Hyphens ignored– Neutralizes harm done by typesetters

Too many search results? Providers maydisclose (or not)…– Sub-object hierarchy using reserved ‘/’– Variant objects using reserved ‘.’– Usual %hh (hex encoding) as an escape

Page 39: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Revealing hierarchy in ARKs

Sub-object hierarchy using reserved x/y• Meaning x is containing object for y, for some

defined containment relationship• Works for chapters, sections, pages, etc• Persistence of that relationship?

Saying ark:/12025/654/xz/321 implies also– ark:/12025/654/xz– ark:/12025/654

Page 40: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Revealing variants in ARKs

Sub-object hierarchy using reserved x.y• Meaning x is an object basename with variant

y, for some defined variant relationship• Works for formats, versions, languages, etc• Persistence of that relationship?

Saying ark:/12025/654.20v.78g implies also– ark:/12025/654.20v– ark:/12025/654

Page 41: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

ARK namespaces reserved 12025 National Library of Medicine 12026 Library of Congress 12027 National Agriculture Library 13030 California Digital Library 13960 Internet Archive 13038 World Intellectual Property Organization 20775 University of California San Diego 29114 University of California San Francisco 28722 University of California Berkeley 15230 Rutgers University Libraries 64269 UK Digital Curation Centre 62624 New York University Libraries 67531 University of North Texas Libraries 27927 Portico/Ithaka Electronic-Archiving Initiative 12148 National Library of France 78319 Google Book Search

Reserve a namespace by email to [email protected]

Page 42: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Opaque identifier toolsNon-opaque identifier strings are chosen deliberately to

assert some things that are true at the time ofassignment

Opaque identifier strings are best chosen by automatedmeans, such as

• NOID (nice opaque identifier)• Or UUID/GUID (universally unique identifier)

– Sequence of hex encodings of your computer’s MACaddress, current time, and sometimes a random number

– No need to ask permission or register yourself, but based onIEEE and hardware vendor registries

Page 43: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Nice opaque identifiers (NOID)

• A noid minter is a lightweight database forgenerating, tracking, and binding unique ids

• The noid tool creates minters and acceptscommands that operate them– Open source, available at www.cpan.org

• Can mint in random or sequential order, with orwithout a check character guaranteeing against themost common transcription errors

• Anyone can run a noid minter, maintain associationsvia bindings to arbitrary elements (assertions), andset up a resolver (including rule-based)

Page 44: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Using NOID

• Identifiers minted according to a template: noid dbcreate f5.reedeedk long 13030

which produces as first minted id 13030/f54x54g11

• Noid is scheme-independent– Can be used to mint DOIs, URNs, URLs, lotto

numbers, etc.– We (at CDL) use it to mint sequential transaction

ids and randomized ARKs with check chars

Page 45: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Documentation

• ARK specification http://www.ietf.org/internet-drafts/draft-kunze-ark-11.txt

• NOID http://www.cdlib.org/inside/diglib/ark/noid.pdf

Page 46: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Persistent citation next steps

Thinking through how to express whenyou’re citing a snapshot vs a stage thatenacts a thematically consistent thing

Persistent citation for small organizations– Their objects are served from at-risk

hostnames and servers– They need a Name-to-Thing resolver

Page 47: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

N2T “Entity” – persistent identifiersfor smaller organizations

Establish a consortium and a small web server.

Each member publishes URLs under n2t.info:

http://n2t.info/12345/foo/bar.zaf

…which redirects to the member server URL.

Why? It solves the same persistent identifierproblem as URN, DOI, and Handle systems,but more fully, and at lower cost and risk.

Page 48: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Persistent identification

Persistent identifiers? We have them.• But still need persistent actionable idsActionable “with widely available tools”• Which really means “with URLs”URNs, ARKs, Handles, DOIs, etc. become

actionable (practically speaking) whenembedded in URLs

• All these ids have similar maintenance costs,and they all break for the usual reasons

Page 49: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

The usual reasons

Whatever the string, what matters is the thing• If the thing’s unavailable, the id’s brokenBroken, for URLs, means either• The hostname is broken

– Server down, gone, or renamed *– Domain name lost, provider out of business

• Or the pathname to the thing is broken– Thing down, gone, or renamed *

* No global fix for these, only the provider can fix.

Page 50: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Hostname instability

Domain name lost, provider out of business• We can help this caseSmaller organizations most vulnerable• The comfort of not seeing your hostname• The comfort of seeing your hostname• Traditional solutions: PURL, URN, Handle, DOI

• Solutions tied to special-purpose technology,sometimes complex and proprietary

Page 51: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

N2T (Name-to-Thing)

N2T is two things at once• A consortium of cultural memory organizations … and …• A small, ordinary web server, mirrored in several

instances globally for reliability

Basic idea: protect 200 organizations’ URLs fromhostname instability with 200 rewrite rules

How: simple HTTP redirects, one per organization

Page 52: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

N2T – user point of viewEach consortium member organization

gets a unique number, such as, 12345.

Page 53: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

N2T – system point of view

Technically, resolution (access to a thinggiven its name) is two simple steps.

Page 54: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

N2T – consortium point of view

“Consortium-lite”– Members have no fees or responsibilities– One domain name for whole consortium

Rent is $30/year, runs on 4 total web servers

Volunteer member orgs run the servers– 1 primary + 3 mirrors

Interested bodies: CENDI, DLF, DCCInterested institutions: CDL, NYU, NLA, …

Page 55: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

N2T – global point of viewRegional (eg, Europe, Asia, North America) clusters of

mirrored resolver instances, with round-robin failoverfor redundancy, fault-tolerance, and load-sharing

• No browser modifications required

User webbrowser

resolverinstance

resolverinstance

resolverinstance

resolverinstance

Page 56: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Namespace Splitting Problem

URLURN/Handle/DOI

resolveruserURL’

URL’’11 12 13 14 15

2 3 5page

webbrowser

URL’6 7 8 9 10

1 4

URL’’2 3 5

page 12345

B

C A

Org’n A’s namespace splits when B and C inherit its objects.Under the URN/Handle/DOI model, B must still forward to C.

Page 57: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Solution: add per-object rulesOver years: add per-object redirects as objects go their separate ways• No server other than N2T has to keep forwarding tablesThis requires periodic (e.g., daily) harvesting of local table mappings

from well-known provider-side server files, e.g., Google sitemaps

globalresolverUser

webbrowser

to/from finalweb server

universitylibrary

nationallibrary

nationalarchive. . .

harvested updates

Page 58: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Prototype resolver

Sample identifiers at n2t.info – these work nowhttp://n2t.info/12345/libraries/visitor.htmlhttp://n2t.info/13030/insidehttp://n2t.info/urn:nbn:se:uu:diva-3324http://n2t.info/ark:/13030/tf5p30086k

Incidentally, it can also redirect all URNs, DOIs,and Handles, e.g.,

http://n2t.info/doi:10.1111/j.0307-6946.2004.00571.x

Page 59: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Advantages of the n2t.infopersistent identifier resolver

• Big reduction in architectural complexity• No browser modification required• Identifier scheme-agnostic• No proprietary, special-purpose infrastructure to carry

forward as a liability to persistence

Page 60: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Identifier myths

What’s an identifier as opposed to a merestring of data? Are these identifiers?

1L039G81UPeter

http://www.cdlib.edu/staff/~greenstein

Page 61: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Identifiers and locations

We want identifiers that are location-independent. Which of these arelocation-independent?

http://dot.ucop.edu/home/jakhttp://n2t.info/14998/xt8732r

http://n2t.info/14998/xt8732r?x=foo&y=bar

Page 62: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Dynamic and static

People complain about dynamicallygenerated pages. Which of these aredynamically generated?

http://dot.ucop.edu/home/jakhttp://n2t.info/14998/xt8732r

http://n2t.info/14998/xt8732r?x=foo&y=bar

Page 63: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Identifier basics

An identifier is an association between a stringand a thing; you cannot tell by looking at it– Strongest association: object wearing its identifier

In the real world, an identifier is also an opinion,and opinions will differ

Identifier persistence is purely a matter of service,not magically conferred by a scheme– Identifier form never imparts anything meaningful for

persistence concerning location or static-vs-dynamicIdentifiers break due to politics, poorly set user

expectations, and low organization awareness

Page 64: Supporting Persistent Citation - GitHub Pagesjkunze.github.io/Pcite.pdf · Identifiers and citations Persistent actionable identifiers should… •Lead to citation •Lead to flavors

Identifier/citation next stepsCiting a changing object• NLM permanence ratings• Change history (back and forth))• Cheap fixityCongruence between databases and documents• Hierarchy, complexityAnnotations are objects too• Main distinguishing feature being the target objectUnits of citation, delivery, addressing, recall