Upload
andy-powell
View
2.832
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A presentation given at the Digital Curation Centre Joint Workshop on Future-Proofing Institutional Websites, held in London in January 2006. See http://www.dcc.ac.uk/events/fpw-2006/
Citation preview
Jan
uary
20
06
Andy Powell, Eduserv [email protected]
www.eduserv.org.uk/foundation
Persistently identifying Web site content
Future-proofing Institutional Web sitesDCC and Wellcome Library workshop
January 2006Future-proofing institutional Web sites 2
Contents
• context
• functional requirements
• issues raised
• practical suggestions
• note: not going to look at any particular solutions in any detail – PURLs, DOIs, Handles, ARKs, …
January 2006Future-proofing institutional Web sites 3
Context – institutional Web sites
• institutional Web sites are:– heterogeneous – i.e. wide variety of content,
managed/unmanaged, formal/informal
– primarily accessed via mainstream Web browsers – but that may change over time
– dynamic – i.e. content is regularly added (and changed and removed!)
– closely tied to the institution – and institutions are liable to change!
January 2006Future-proofing institutional Web sites 4
Context – man vs. machine
• identifiers serve a human andmachine/software purpose
– person: “here’s one I foundearlier” – e.g. using del.icio.usor connotea
– machine: “is this the same asthat?”
• worth remembering that machines tend to be fairly stupid…– e.g. if some people use the PURL and some use the corresponding URL,
then del.icio.us won’t spot that their entries are about the same thing
• in most cases, being able to resolve the identifier is helpful to both people and machines
• in most cases, the longer an identifier lasts, the better – even after the resolution service breaks!
January 2006Future-proofing institutional Web sites 5
Context – what is being identified
• the most important question in any discussion about identifiers is “what is being identified?”
• in the case of institutional Web sites…– the site
– significant parts of the site
– static documents, individual images, etc.
– dynamic services
– …
• some possibility for confusion here– e.g. what does http://www.bris.ac.uk/ identify?
• but in the case of institutional Web sites, people usually do the ‘right thing’ and what is being identified is obvious from the context…
January 2006Future-proofing institutional Web sites 6
Context - works vs. manifestations
• one key aspect is whether the identifier is for an abstract ‘work’ or a particular ‘’manifestation’ of that work
• there are some scenarios in which it is necessary to identify the ‘work’…
• in other cases, it is necessary to identify a particular ‘manifestation’ of the work
• beginning to see this problem in the development of eprint archives and institutional repositories
“Crystal Studio is a recommended
resource for the teaching of
crystallography at undergraduate
level.“
"To perform this exercise you will need a copy of Crystal Studio version 5.0
(versions 4.0 Lite and 4.0 Professional do not support the required options)."
January 2006Future-proofing institutional Web sites 7
Every significant item that is made available through a JISC IE network service should be assigned a URI that is reasonably persistent. This means that item URIs should not be expected to break for a period of 10-15 years after they have first been used. For this reason, JISC IE service components should not hardcode file format, server technology, service organisational structure or other information that is likely to change over a 10-15 year period into item URIs. If items become unavailable during that period, then the URI should resolve to a Web page that explains why the item is no longer available and what actions the end-user can take to obtain a copy of the item or similar resources. Furthermore, item URIs should not contain end-user-specific information, i.e. all item URIs should work for all end-users (albeit allowing for appropriate authentication challenges to be inserted into the process by which the URI is resolved).
Functional requirements…
• the JISC IE technical standards document says…
http://www.ukoln.ac.uk/distributed-systems/jisc-ie/arch/standards/
Every significant item that is made available through a JISC IE network service should be assigned a URI that is reasonably persistent. This means that item URIs should not be expected to break for a period of 10-15 years after they have first been used. For this reason, JISC IE service components should not hardcode file format, server technology, service organisational structure or other information that is likely to change over a 10-15 year period into item URIs. If items become unavailable during that period, then the URI should resolve to a Web page that explains why the item is no longer available and what actions the end-user can take to obtain a copy of the item or similar resources. Furthermore, item URIs should not contain end-user-specific information, i.e. all item URIs should work for all end-users (albeit allowing for appropriate authentication challenges to be inserted into the process by which the URI is resolved).
January 2006Future-proofing institutional Web sites 8
What should be identified?
• “every significant item”
• what does that mean?
• every resource that people are likely to want to cite persistently?
• there might be stuff on institutional Web sites that we don’t need to cite persistently
– but often difficult to pre-judge what is significant and what isn’t
– and judgements about significance and required level of persistence may come from outside the institution
January 2006Future-proofing institutional Web sites 9
What does ‘reasonably persistent’ mean?
• notion of ‘persistence’ is application dependent
• perhaps helpful to think about 15 – 20 year timeframe?– longer than the Web has been around to date
– solutions for 20 year period may well last longer
– ‘forever’ is too long
• what will have changed in 20 years time?– technology - HTML replaced? HTTP replaced? DNS
replaced? URI system replaced?
– organisations – mergers, closures, new institutions, new government departments, etc.
– people – deaths, retirements, etc.
– countries!
January 2006Future-proofing institutional Web sites 10
What does ‘break’ mean?
• what does it mean for an identifier to break?
• need to differentiate between the breakage of services on the identifier and breakage of the identifier itself
• most obvious services on identifiers are ‘resolution services’– “give me a representation of the identified thing”
– known as ‘dereferencing’ in W3C documentation
• resolution services can break (by design or by accident) but the identifier may live on and remain useful
• the identifier itself only breaks when all parties (including software systems) have forgotten what it identified, or when parties no longer agree about what it identifies (e.g. if it gets re-assigned)
January 2006Future-proofing institutional Web sites 11
Usability issues
• “the only good long-term identifier is a good short-term identifier”
• unless identifiers work well now, then they won’t turn into persistent identifiers because they won’t be used at all
• what does “work well” mean (particularly in the context of institutional Web sites)?
– conformant with current Internet standards
– usable in Web browsers (without additional plug-ins - i.e. usable by everyone)
– meaningful to people
– resolvable
– simple to assign and maintain
– low cost (in terms of money and time)
January 2006Future-proofing institutional Web sites 12
Interim conclusions…
• identifiers for content on institutional Web sites should be URIs
– why? because the URI is the global and unambiguous standard for identifiers on the Internet
• ‘http’ URIs are better than any other form of URI– why? because they work in current Internet
tools, particularly Web browsers
– built-in resolution mechanism
– easy to assign and low-cost (typically!)
January 2006Future-proofing institutional Web sites 13
‘http’ URI problems?
• but ‘http’ URIs tend to break don’t they?– note: usually it is the resolution service that breaks (i.e. they
stop working as locators) - this doesn’t necessarily imply that they stop functioning as identifiers though the two may be closely related
• reasons for fragility of ‘http’ URI resolution examined later
• but ‘poor design’ and lack of commitment often to blame
• not necessarily the case that one can apply generic Internet-wide findings about ‘http’ URI breakage to ‘institutional’ Web sites
• attempts at more persistent forms of identifier often based on moving away from direct ties to HTTP and/or introducing a level of indirection
January 2006Future-proofing institutional Web sites 14
How indirection works (or not?)
• populate resolution service tables with identifier -> locator mappings (and possibly other metadata)
– DOI: 10.1000/182 -> http://www.doi.org/hb.html
– Handle: 4263537/4002 -> http://www.handle.net/documentation.html
– ARK: http://ark.nlm.nih.gov/ark:/12025/pm10611131 -> http://brain.oxfordjournals.org/cgi/content/full/123/1/171
– PURL: http://purl.org/net/ukoln -> http://www.ukoln.ac.uk/
• typically used as the basis for HTTP redirects, e.g.– http://dx.doi.org/10.1000/182 -> http://www.doi.org/hb.html
– http://hdl.handle.net/4263537/4002 -> http://www.handle.net/documentation.html
– etc.
• helps to ensure persistence… but– HTTP redirects not handled very well by browsers - end-user is
typically left using the non-persistent URI – need commitment to maintain resolver services and tables
– introduces a second (at least) identifier for each resource
January 2006Future-proofing institutional Web sites 15
What about uniqueness?
• the same identifier should not be assigned to more than one resource
• a resource may have more than one identifier assigned to it… but this should be avoided as far as possible
– e.g. the DOI “10.1000/182” can be encoded as a URI in several ways:
– http://dx.doi.org/10.1000/182, doi:10.1000/182, urn:doi:10.1000/182 and info:doi/10.1000/182
– therefore, DOI-aware applications need to have knowledge of these encodings hard-coded into them (partly because the DOI itself is just a string, but also because nothing in the URI specification indicates that the URI encodings are equivalent)
– though within a domain this may become the norm (e.g. Google Scholar, Crossref, Connotea, etc.)
January 2006Future-proofing institutional Web sites 16
ARK system
• ARKs are worthy of note since they are ‘http’ URIs– and therefore meet many of the usability
requirements outlined earlier
• ARKs clearly flag an institutional commitment to persistence
– the identifier owner (often the resource owner) commits to maintaining ARK services and associated metadata
– no reliance on third-party resolver
• but they suffer from the HTTP redirect problem
• and ultimately may lead to multiple URIs being assigned to a single resource
January 2006Future-proofing institutional Web sites 17
Anatomy of ‘http’ URIs
http://www.somewhere.ac.uk/physics/index.cfm?name=about
http://www.somewhere.ac.uk/chemistry/report.rtf
‘http’ URI scheme – URI persistence not reliant on HTTP protocol, but is reliant on continued registration and management of the scheme (and of the URI spec. itself!)
DNS domain name – persistence reliant on continued ownership and management of the DNS domain name (and the DNS!)
Component hierarchy, often organisationally based – persistence reliant on continued management of component structure, i.e. not re-using old components
Server technology – change of technology may enforce change of URI, leading to multiple URIs for same resource (with no simple mechanism for determining equivalence)
File format – inappropriate if identifier is for the ‘work’ rather than the ‘manifestation’ - because changing the format will result in a new URI
January 2006Future-proofing institutional Web sites 18
Improving persistence of ‘http’ URIs
• choose long-lived DNS domain names – e.g. try to avoid details of internal organisational structure
• partition URI components by ‘function’ rather than by organisational structure - because structure is likely to change
• avoid exposing Web server technology in URIs (Cold Fusion, PHP, etc.) - to allow changes to technology without URI proliferation and resolver breakage
• avoid embedding details of document format into URIs, unless particular manifestation is being identified
• avoid embedding end-user or session information into URIs – so that they can be shared between people
January 2006Future-proofing institutional Web sites 19
Conclusions and recommendations
• persistent identifiers require persistent commitment from the institution (and third-parties)
• need to determine what ‘persistent’ means in practice (on the basis that ‘forever’ is unrealistic)
• ‘http’ URIs can be made more persistent if they are constructed and managed sensibly
• use of DOIs/Handles/ARKs/PURLs may be appropriate (particularly where domain practice is clear)
– but need to be clear about cost/benefits and institutional and third-party commitment to maintaining resolver tables and associated services
– where these are used, always and only use the ‘http’ form of URI (e.g. http://dx.doi.org/10.1000/182)
January 2006Future-proofing institutional Web sites 20
Questions…