Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Supporting Persistent Citation
Webcast 4pm 11 December 2006
John Kunze, California Digital Library
California Digital Library
• A university library with no books, students, orfaculty
• Central services for 10 campuses– 208,000 students– 121,000 faculty and staff– 100+ libraries and museums
• Content hosting: electronic texts, web-basedmaterial, finding aids, scanned book content(OCA, Google), datasets
What’s digital preservation?
Storing digital objects while retaining a balanceof usability and faithfulness (truthiness) totheir creators’ original intentions
What’s digital preservation?
Storing digital objects while retaining a balance ofusability and faithfulness (truthiness) to theircreators’ original intentions
Truthiness defined by the Designated Community
Kinds of loss
• Hard loss - some or all data bits are missing• Soft loss - we think we have all the bits …• Syntactic loss - bits are there, but format
cannot be rendered by software• Semantic loss - data renderable without
apparent error, but not understandably• Legal loss - data format or data itself is legally
encumbered
Digital preservation in two parts
• Object: safeguarding its…– Viability (intact bit streams)– Renderability (by machines)– Understandability (by humans)
• Citation: no preservation if we don’t know…– What the object is, i.e., a summary description– What flavor of support (assuming non-accidental
preservation) is intended by the provider– How to get it, or its persistent actionable identifier
Our Stuff vs Their Stuff
• While safeguarding objects is primary, focus now ontools for persistent reference: citations and identifiers
• Persistent reference can be split into– the Our Stuff Problem– the Their Stuff Problem
• No sense assigning persistent ids to Their Stuff– While Their Stuff is hugely important to Our users, we don’t
control it and we cannot vouch for it– Where affordable, we might track Their Stuff (eg, PURLs)
• New focus: persistent reference to Our Stuff
Citations as object surrogatesSurrogates provide a time-honored way of avoiding the inconvenience
of directly handling objects.– Surrogates are usually much smaller (eg, a catalog card)– Unlike the objects, surrogates may be unencumbered and in a language
that you understand– Surrogates are much more uniform (for easier processing) than objects– Every system has surrogates, even if dynamically generated– A surrogate is essentially an object citation
Reminder: What is a citation for?– A citation is a surrogate-based tool to help us find, use, and manage
information objects, resources, or stuff.
Citation metadata
Metadata definition 1: “data about data”– Too broad and too narrow, e.g., a book review? a catalog record for
a statue?Metadata definition 2: “structured data about stuff”
– “stuff” avoids having to say a statue is data– “structured” data assists automation by making it easy to recognize
and record individual data elements– The more uniform, the more leverage for interoperation– Automation + Interoperation ⇒ Protocol
Citation metadata is structured data that is usually, but not always,secondary to, smaller than, and about stuff that is primary
Why citations?
The identifier isn’t enough• Higher confidence through limited redundancy
The object is too much• Often inconvenient to handle objects directly• Need easy handling of diverse objects through
uniform surrogates
Activating citations with protocols:simplicity / functionality pendulumIn the beginning, … application protocols were layered on TCP/IP
– Simplicity: email set the standard (RFC 822 headers)– HTTP, NNTP, gopher, etc. followed its lead; OSI protocols withered
Then: second system syndrome (expanding functionality):– Z39.50, CORBA, SOAP, and others
Regret period (contracting complexity):– OpenSearch, RSS, and in DL world, SRW/SRU, OAI
How are we doing?– Tues 13 June 2006: “low barrier” OAI failures attributed to errors in XML
coding, schemas; poor, inconsistent, and expensive metadata; withsurrogates too non-uniform to be of much use [CL & CL]
Perhaps things are still too complicated?
Simple citation metadata isn’t
Dublin Core metadata tried to be simple– Goal: “specification shouldn’t register on a bathroom scale”
Goal achieved, but spec. was under-specified– No definition of record– No concept of minimal object description– No layout rules for author names and dates– No practical extension framework– No meta-metadata, eg, provenance, commitment statements
15 Dublin Core elementsThought to apply to almost any object – discovery as goal
TitleSubjectSource
LanguageRightsRelationIdentifierPublisherTypeFormatCreatorDescriptionDateContributorCoverage
InstantiationIntellectual PropertyContent
Despite DCMI efforts to correct known problems, the simplestprotocol with the simplest metadata – OAI – reports an overall 36%failure rate, 77% due to metadata/encoding and protocol errors.
“Simple” Dublin Core metadata
<?xml version="1.0"?><!DOCTYPE rdf:RDF PUBLIC "-//DUBLIN CORE//DCMES DTD
2002/07/31//EN" "http://dublincore.org/documents/2002/07/31/dcmes-xml/dcmes-
xml-dtd.dtd"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description
rdf:about=”http://www.nap.edu/books/0309064996/html/”> <dc:title>The Digital Dilemma</dc:title> <dc:creator>National Research Council</dc:creator> <dc:date>2000-06-22</dc:date> </rdf:Description></rdf:RDF>
Same record with Dublin “Kernel”
Here’s the same information, still machine-readable, as anElectronic Resource Citation (ERC) with Kernel metadata:
erc:who: National Research Councilwhat: The Digital Dilemmawhen: 2000where: http://books.nap.edu/html/digital%5Fdilemma
Motivators for the ERC– Meet the need for a simple and manipulable record– Direct human contact with metadata is inevitable– Record should place minimal strain on people– Succinct, transparent, trivially parseable (2 lines of Perl code)
Making it minimal: Kernel/ERCElectronic Resource Citation (ERC) - back to basics• An ERC record is a sequence of elements in email header format:
⇒ label, colon, value• Long values are continued on indented lines• A blank line ends a record
Based on cross-domain kernel distilled from Dublin Core• who - a responsible person or party• what - a name or other human-oriented identifier• when - a date important in the object’s lifecycle• where - a location or a machine-oriented identifier
The ERC notion of “story”The same record as before, in its most compact form:
erc: National Research Council | The Digital Dilemma | 2000 | http://books.nap.edu/html/digital%5Fdilemma
Either ERC form starts by telling the story of an expression of theresource, applying who-what-when-where questions to it.– All 4 kernel elements are required– Absent values must be explained; 7 flavors of “empty”– Element ordering is rigid in compact form (positional semantics)– Arbitrary additional elements may occur after the 4 elements
Other segments in the ERC may introduce other stories, such as,– erc-about, erc-support, erc-from
A 2-story ERC record erc: who: Tomlinson, Richard what: Adjustable knock down chair when: (:unkn) where: http://espacenet.com/dips/bnsviewer%{ ? CY=ec & LG=en & DB=EPD & PN=US5498054 & ID=US+++5498054A1+I+ %} erc-support: who: European Patent Office what: (:permuc) Permanent, Unchanging Content # Note to ops staff: verify date. when: 20010621 where: http://ark.espacenet.com/ark:/23003/US5498054
Mapping ERC to Dublin Core
Coverage (spatial) whereCoverage (temporal) whenSubject whatNone who
erc-aboutIdentifier whereDate whenTitle whatCreator/Contributor/Publisher who
ercEquivalent DC ElementKernel Element
ERC special valuesControlled element values have the form, “(:ccode)”
– e.g., missing: (:unkn) Anonymous, (:unas) Unassigned– e.g., general: (:791) Bee Stings
Sort-friendly values keyed off of initial commawho:, van Gogh, Vincentwho:, Howell, III, PhD, 1922-1987, Thurstonwho:, Mao Tse Tungwhat:, Health and Human Services, United States GovernmentDepartment of, The,
and their equivalents in natural word order:Vincent van GoghThurston Howell, III, PhD, 1922-1987Mao Tse TungThe United States Government Department of Health and
Human Services
ERC dates and expansion blocksERC value with an “expansion” block - “%{“ and “%}”
where: http://foo.bar.org/node%{?db= foo&start = 1&end = 5&buf = 2&query = foo + bar + zaf
%}
is equivalent to the correct and intact URL, where: http://foo.bar.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf
Dates are in TEMPER format1996-2000 (range of four years)1952, 1957, 1969 (list of three years)1952, 1958-1967, 1985 (mixed list of dates & ranges)20001229-20001231 (range of three days)
Kernel/ERC summaryERC is a cheap, general-purpose citation container• It’s kernel metadata is designed to be a low-barrier
way to support orderly management of collections• Might help resource discovery and description too• Succinct, trivial to parse, extensible yet predictable in
the kernel elementsSee http://dublincore.org/groups/kernel/ for more
How to activate an ERC? One way is with THUMP.
Searching and retrievingcitations with THUMP
THUMP: The HTTP URL Mapping Protocol• A set of simple URL-based conventions for retrieving
information and conducting searches• Can be used for focused retrievals or for broad
database searches• Based on commands put in the query string after ‘?’
http://example.foo.com/?in(books)find(war and peace)show(full)
THUMP requestsThe HTTP URL Mapping Protocol (THUMP)
– A protocol based on HTTP and URLs– A request is passed to a server with HTTP GET (or POST)
Shortest request is a URL ending in `?', as in http://example.foo.com/object321?Which is shorthand for the common request: http://example.foo.com/object321?show(brief)as(anvl/erc)
Naked ‘?’ and ‘??’ designed to support the known-itemquery convention arising in the ARK persistent idscheme
THUMP responsesResponses consist of HTTP response headers, one record set header,
and one or more ERC records 1 C: [opens session] C: GET http://ark.cdlib.org/ark:/13030/ft167nb0vq? HTTP/1.1 C: S: HTTP/1.1 200 OK 5 S: Content-Type: text/plain S: THUMP-Status: 0.5 200 OK S: S: set-start: California Digital Library | THUMP 0.5 | 20060606161407 S: | http://ark.cdlib.org/ark:/13030/ft167nb0vq?10 S: | http://dublincore.org/groups/kernel/erc S: here: 1 | 1 | 1 S: S: erc: S: who: Stanton A. Glantz and Edith D. Balbach15 S: what: Tobacco War: Inside the California Battles S: when: 20000510 S: where: http://ark.cdlib.org/ark:/13030/ft167nb0vq S: [closes session]
Broad searching in THUMP
General form of broad queryKey ? in(DB) find(QUERY) list(RANGE) show(ELEMS) as(FORMAT)
Many details to be worked out; watch forhttp://www.ietf.org/internet-drafts/draft-kunze-thump-01.txt
Identifiers and citations
Persistent actionable identifiers should…• Lead to citation• Lead to flavors of permanence• Lead to access (if authorized)
– Not strictly an “identification” problem, but this isthe “404 not found” that we need to fix
• Be valid for some longish period• Be carried on, in, or with the object
Choosing an id scheme
The most perfect identifier scheme will beuseless in the face of service failure– All identifier persistence is purely about service– Identifier confidence depends on service access
Candidate schemes: URL, PURL, URN, ARK,Handle, DOI, MD5, GUID, ISxx, …– Which of these builds in service access points?– Which may actually heighten your risk without
lowering your costs?
Impact of identifier scheme choiceon causes of identifier breakage
NoneNoneNoneNoneNoneNoneProvider removed or replacedthe object
NoneNoneNoneNoneNoneNoneProvider moved object, didn’tupdate forwarding/redirect table
NoneNoneNoneNoneNoneNoneWar, social upheaval, naturaldisaster
NoneNoneNoneNoneNoneNoneLoss of political support, nosuccessor found
NoneNoneNoneNoneNoneNoneBankruptcy or funding loss, nosuccessor found
URNURLPURLHandleDOIARK Identifier schemeCause
Costs of identifier scheme choice
YesYesYesYesYesYesRequires maintaining a two-column forwarding table
YesNoSortaYesYesNoNeeds complex, special-purposeglobal infrastructure or plug-in
SortaYesYesSortaSortaNoRequires special interventionbeyond the reign of HTTP
YesYesYesYesYesYesNeeds web server, web browser,and DNS, or future equivalents
URNURLPURLHandleDOIARK Identifier schemeCost
Benefits of identifier scheme choice
SortaNoNoNoNoYesSeparation of Name Assigningand Mapping Authorities
YesNoSortaYesYesNo/yes
Protection for small institutionsfrom hostname instability
NoNoNoNoNo/yes
YesAPI to object description
NoNoNoNoNoYes/yes
Protection from hyphenation andnamespace splitting
NoNoNoNoNoYesAPI to object persistence policystatement
NoNoNoNoNoYesGlobal lexical inference of logicalsub-objects and object variants
URNURLPURLHandleDOI/Crossref
ARK/ N2T
Identifier schemeBenefit
ARK identifiers at a glanceAn ARK (Archival Resource Key) is a URL with extra structure and
conventions. A sample ARK for a digital object:
http://cdlib.org/ark:/12025/654xz321 \______________/\__/ \___/ \______/ (replaceable) | | | | ARK Label | | | | |1 Name Mapping Authority | 3 Name (NAA-assigned) | 2 Name Assigning Authority Number (NAAN)
1 = current service provider; identity inert “booster rocket”2 = organization that originally assigned the id3 = name originally assigned to the abstract object, often opaque
ARK usageTwo ARKs accessing the same thing http://loc.gov/ark:/12025/654xz321 http://rutgers.edu/ark:/12025/654xz321
Access to metadata -- add a ‘?’ http://loc.gov/ark:/12025/654xz321?
Access to support statement -- add ‘??’ http://loc.gov/ark:/12025/654xz321??
Three minimal requirements to be an ARK– An archive that can’t do all 3 -- trustworthy?– Is an ARK persistent? Maybe. Have to ask.
Transparency vs opaqueness
• Do ARKs have to be this ugly (opaque)? http://foobar.zaf.org/ark:/12025/654xz321/s3/f8.05v.tiff \___________________/ \__/ \___/ \______/ \____________/ NMAH Label NAAN Name Qualifier
• No, but they encourage it. Persistence is all about managingassociations between strings and things– And the landscape is littered with links that were required to die for
political, legal, or social reasons– the appearance, deliberate or even accidental, of once-true
assertions that are now misleading, infringing, offensive makes ithard for our descendants to continue managing
• Pain of managing opaque ids mitigated by strong metadata– But opacity makes for ids that age and travel well– Hybrid: opaque ids name abstract preservation objects, and
semantic/transient extensions address sub-objects
ARK lexical goodies
Hyphens ignored– Neutralizes harm done by typesetters
Too many search results? Providers maydisclose (or not)…– Sub-object hierarchy using reserved ‘/’– Variant objects using reserved ‘.’– Usual %hh (hex encoding) as an escape
Revealing hierarchy in ARKs
Sub-object hierarchy using reserved x/y• Meaning x is containing object for y, for some
defined containment relationship• Works for chapters, sections, pages, etc• Persistence of that relationship?
Saying ark:/12025/654/xz/321 implies also– ark:/12025/654/xz– ark:/12025/654
Revealing variants in ARKs
Sub-object hierarchy using reserved x.y• Meaning x is an object basename with variant
y, for some defined variant relationship• Works for formats, versions, languages, etc• Persistence of that relationship?
Saying ark:/12025/654.20v.78g implies also– ark:/12025/654.20v– ark:/12025/654
ARK namespaces reserved 12025 National Library of Medicine 12026 Library of Congress 12027 National Agriculture Library 13030 California Digital Library 13960 Internet Archive 13038 World Intellectual Property Organization 20775 University of California San Diego 29114 University of California San Francisco 28722 University of California Berkeley 15230 Rutgers University Libraries 64269 UK Digital Curation Centre 62624 New York University Libraries 67531 University of North Texas Libraries 27927 Portico/Ithaka Electronic-Archiving Initiative 12148 National Library of France 78319 Google Book Search
Reserve a namespace by email to [email protected]
Opaque identifier toolsNon-opaque identifier strings are chosen deliberately to
assert some things that are true at the time ofassignment
Opaque identifier strings are best chosen by automatedmeans, such as
• NOID (nice opaque identifier)• Or UUID/GUID (universally unique identifier)
– Sequence of hex encodings of your computer’s MACaddress, current time, and sometimes a random number
– No need to ask permission or register yourself, but based onIEEE and hardware vendor registries
Nice opaque identifiers (NOID)
• A noid minter is a lightweight database forgenerating, tracking, and binding unique ids
• The noid tool creates minters and acceptscommands that operate them– Open source, available at www.cpan.org
• Can mint in random or sequential order, with orwithout a check character guaranteeing against themost common transcription errors
• Anyone can run a noid minter, maintain associationsvia bindings to arbitrary elements (assertions), andset up a resolver (including rule-based)
Using NOID
• Identifiers minted according to a template: noid dbcreate f5.reedeedk long 13030
which produces as first minted id 13030/f54x54g11
• Noid is scheme-independent– Can be used to mint DOIs, URNs, URLs, lotto
numbers, etc.– We (at CDL) use it to mint sequential transaction
ids and randomized ARKs with check chars
Documentation
• ARK specification http://www.ietf.org/internet-drafts/draft-kunze-ark-11.txt
• NOID http://www.cdlib.org/inside/diglib/ark/noid.pdf
Persistent citation next steps
Thinking through how to express whenyou’re citing a snapshot vs a stage thatenacts a thematically consistent thing
Persistent citation for small organizations– Their objects are served from at-risk
hostnames and servers– They need a Name-to-Thing resolver
N2T “Entity” – persistent identifiersfor smaller organizations
Establish a consortium and a small web server.
Each member publishes URLs under n2t.info:
http://n2t.info/12345/foo/bar.zaf
…which redirects to the member server URL.
Why? It solves the same persistent identifierproblem as URN, DOI, and Handle systems,but more fully, and at lower cost and risk.
Persistent identification
Persistent identifiers? We have them.• But still need persistent actionable idsActionable “with widely available tools”• Which really means “with URLs”URNs, ARKs, Handles, DOIs, etc. become
actionable (practically speaking) whenembedded in URLs
• All these ids have similar maintenance costs,and they all break for the usual reasons
The usual reasons
Whatever the string, what matters is the thing• If the thing’s unavailable, the id’s brokenBroken, for URLs, means either• The hostname is broken
– Server down, gone, or renamed *– Domain name lost, provider out of business
• Or the pathname to the thing is broken– Thing down, gone, or renamed *
* No global fix for these, only the provider can fix.
Hostname instability
Domain name lost, provider out of business• We can help this caseSmaller organizations most vulnerable• The comfort of not seeing your hostname• The comfort of seeing your hostname• Traditional solutions: PURL, URN, Handle, DOI
• Solutions tied to special-purpose technology,sometimes complex and proprietary
N2T (Name-to-Thing)
N2T is two things at once• A consortium of cultural memory organizations … and …• A small, ordinary web server, mirrored in several
instances globally for reliability
Basic idea: protect 200 organizations’ URLs fromhostname instability with 200 rewrite rules
How: simple HTTP redirects, one per organization
N2T – user point of viewEach consortium member organization
gets a unique number, such as, 12345.
N2T – system point of view
Technically, resolution (access to a thinggiven its name) is two simple steps.
N2T – consortium point of view
“Consortium-lite”– Members have no fees or responsibilities– One domain name for whole consortium
Rent is $30/year, runs on 4 total web servers
Volunteer member orgs run the servers– 1 primary + 3 mirrors
Interested bodies: CENDI, DLF, DCCInterested institutions: CDL, NYU, NLA, …
N2T – global point of viewRegional (eg, Europe, Asia, North America) clusters of
mirrored resolver instances, with round-robin failoverfor redundancy, fault-tolerance, and load-sharing
• No browser modifications required
User webbrowser
resolverinstance
resolverinstance
resolverinstance
resolverinstance
Namespace Splitting Problem
URLURN/Handle/DOI
resolveruserURL’
URL’’11 12 13 14 15
2 3 5page
webbrowser
URL’6 7 8 9 10
1 4
URL’’2 3 5
page 12345
B
C A
Org’n A’s namespace splits when B and C inherit its objects.Under the URN/Handle/DOI model, B must still forward to C.
Solution: add per-object rulesOver years: add per-object redirects as objects go their separate ways• No server other than N2T has to keep forwarding tablesThis requires periodic (e.g., daily) harvesting of local table mappings
from well-known provider-side server files, e.g., Google sitemaps
globalresolverUser
webbrowser
to/from finalweb server
universitylibrary
nationallibrary
nationalarchive. . .
harvested updates
Prototype resolver
Sample identifiers at n2t.info – these work nowhttp://n2t.info/12345/libraries/visitor.htmlhttp://n2t.info/13030/insidehttp://n2t.info/urn:nbn:se:uu:diva-3324http://n2t.info/ark:/13030/tf5p30086k
Incidentally, it can also redirect all URNs, DOIs,and Handles, e.g.,
http://n2t.info/doi:10.1111/j.0307-6946.2004.00571.x
Advantages of the n2t.infopersistent identifier resolver
• Big reduction in architectural complexity• No browser modification required• Identifier scheme-agnostic• No proprietary, special-purpose infrastructure to carry
forward as a liability to persistence
Identifier myths
What’s an identifier as opposed to a merestring of data? Are these identifiers?
1L039G81UPeter
http://www.cdlib.edu/staff/~greenstein
Identifiers and locations
We want identifiers that are location-independent. Which of these arelocation-independent?
http://dot.ucop.edu/home/jakhttp://n2t.info/14998/xt8732r
http://n2t.info/14998/xt8732r?x=foo&y=bar
Dynamic and static
People complain about dynamicallygenerated pages. Which of these aredynamically generated?
http://dot.ucop.edu/home/jakhttp://n2t.info/14998/xt8732r
http://n2t.info/14998/xt8732r?x=foo&y=bar
Identifier basics
An identifier is an association between a stringand a thing; you cannot tell by looking at it– Strongest association: object wearing its identifier
In the real world, an identifier is also an opinion,and opinions will differ
Identifier persistence is purely a matter of service,not magically conferred by a scheme– Identifier form never imparts anything meaningful for
persistence concerning location or static-vs-dynamicIdentifiers break due to politics, poorly set user
expectations, and low organization awareness
Identifier/citation next stepsCiting a changing object• NLM permanence ratings• Change history (back and forth))• Cheap fixityCongruence between databases and documents• Hierarchy, complexityAnnotations are objects too• Main distinguishing feature being the target objectUnits of citation, delivery, addressing, recall