View
58
Download
0
Tags:
Embed Size (px)
Citation preview
Prototypes of pro-active approaches to support the archiving of web references for scholarly
communications
Richard Wincewicz1, Peter Burnhill1 & Herbert Van de Sompel2
1EDINA, University of Edinburgh, 2Los Alamos National Laboratory
The Project Team 2013 – 2015, funded by the
Andrew W. Mellon Foundation
• Los Alamos National Laboratory:
Research Library: Herbert Van de Sompel Harihar Shankar, [Martin Klein, Rob Sanderson]
• University of Edinburgh:
Language Technology Group: Claire Grover, Beatrice Alex, Colin Matheson, Richard Tobin, [Ke “Adam” Zhou]
EDINA * : Peter Burnhill, Muriel Mewissen (Project Manager), Tim Stickland, Richard Wincewicz, [Neil Mayo]
Centre for Service Delivery & Digital Expertise
Reference Rot
Links to Web at Large resources are subject to Reference Rot. This is a combination of two factors:
• Link Rot: Link stops working • e.g. HTTP 404 “Not Found”
• Content Drift: Linked content changes over time• Possibly to the extent that it is no longer
representative of the content that was initially referenced
Articles that Link to Articles & to Web At Large Resources (PMC)
Martin Klein et al. (2014) Scholarly context not foundhttp://dx.doi.org/10.1371/journal.pone.0115253
Articles that Link to Articles & to Web At Large Resources (Elsevier)
Martin Klein et al. (2014) Scholarly context not foundhttp://dx.doi.org/10.1371/journal.pone.0115253
Articles with URI References (PMC)
Articles 479,194
with URI references 399,005
with URI references to articles 240,857
with URI references to Web at Large 156,160
Martin Klein et al. (2014) Scholarly context not foundhttp://dx.doi.org/10.1371/journal.pone.0115253
Link Rot (PMC)
Martin Klein et al. (2014) Scholarly context not foundhttp://dx.doi.org/10.1371/journal.pone.0115253
Link Rot (Elsevier)
Martin Klein et al. (2014) Scholarly context not foundhttp://dx.doi.org/10.1371/journal.pone.0115253
Links from arXiv, Elsevier, PMC to TLD Targets
Martin Klein et al. (2014) Scholarly context not found. In: PLOS ONEhttp://dx.doi.org/10.1371/journal.pone.0115253
Grey is Link Rot – Referenced Content Not Accessible
Martin Klein et al. (2014) Scholarly context not found. In: PLOS ONEhttp://dx.doi.org/10.1371/journal.pone.0115253
Grey is Not Archived - Referenced Content Lost
Martin Klein et al. (2014) Scholarly context not found. In: PLOS ONEhttp://dx.doi.org/10.1371/journal.pone.0115253
Content Drift – http://dl00.org
2000 2004
2005 2008
(a) Dynamic contentvalues on webpage change
over time
(b) Static contentbut very different (often
unrelated) web pages
Create Snapshots of Referenced Resources
Various web archives support on-demand creation of snapshots of URIs (manual, API):
archive.today Internet Archive perma.cc webcitation.org
When creating snapshots, maintain: Original URI Snapshot URI Date/Time of snapshot
Create Snapshots of Referenced Resources
Snapshots can be created at various stages. The closer to the moment of referencing, the better the image captured.
Stage Actor Snapshot Quality
Preparation Author/reference tool best
Submission/Issue
Editor/manuscript system
good
PublicationAggregator/
publisher platformok
Post-publicationLibrarian/IR,
journal archivebetter than nothing
Authoring - Zotero Plugin Demonstrator
Richard Wincewicz (2014) Prototype Hiberlink plugin for Zotero for pro-active archiving and temporal references
https://www.youtube.com/v/ZYmi_Ydr65M%26vq
Publication - HiberActive Service Demonstrator
Martin Klein et al. (2014) HiberActive: Pro-Active Archiving of web references from scholarly articles
Open Repositories 2014 http://www.slideshare.net/martinklein0815/hiberactive
Reference Resources Robustly
When referencing resources include:
Original URI – Allows the user to revisit the URI as it is at the time of reading, if the URI is still operational
Snapshot URI – Allows the user to visit the snapshot, if one was created, and if the web archive in which it was created is still operational
Date/Time – with the original URI allow the user to visit any snapshot created around the Date/Time in any web archive around the world (using Memento infrastructure)
(2015) Robust Links - Motivationhttp://robustlinks.mementoweb.org/about/
Reference Resources Actionably
When referencing resources, use Link Decorations to convey Original URI, Snapshot URI, Date/Time
<a href=“http://www.stanford.edu” data-versionurl=“http://archive.is/FAy6o” data-versiondate=“2014-08-15” >
<a href=“http://www.stanford.edu” data-versiondate=“2014-08-15” >
Herbert Van de Sompel et al. (2015) Robust Links - Link Decorationshttp://robustlinks.mementoweb.org/spec/
<a href=“http://archive.is/FAy6o” data-originalurl=“http://www.stanford.edu” data-versiondate=“2014-08-15” >
Robust Links Using Link Decorations, JavaScript, Memento API
Demo - http://robustlinks.mementoweb.org/demo/uri_references_js.htmlrobustlinks.js - https://github.com/mementoweb/robustlinks
Activate Robust Links
There are no Link Decorations, currently. But there is an article publication date:
Express the article publication date in an actionable manner (‘datePublished’ or ‘dateModified’ Schema.org properties) in HTML pages that contain URI references
Tailor robustlinks.js to exclude links to articles
Inject robustlinks.js in HTML pages that contain URI references
Users Follow Robust Links into Web Archives
The combination of the referenced URI and the article publication date:
Leads users to a snapshot in a web archive, created as close as possible to the article publication date
Addresses link rot
Addresses content drift
Create Archive Copies
When ingesting new content into the platform:
Parse for URI references
Create snapshots in web archives of select URIs
For these URIs, use Link Decorations in HTML to convey:
• original URI• snapshot URI • snapshot Date/Time
Users Follow Robust Links into Web Archives
The Link Decorations:
Lead users to the created snapshot, if the web archive is operational
Lead users to a snapshot in any web archive, created as close as possible to the snapshot Date/Time
Addresses link rot
Addresses content drift