Preserving the scholarly record with WebCite (): an archiving system for long-term digital preservation of cited webpages

Gunther Eysenbach MD MPH

Gunther Eysenbach MD MPH

Editor/Publisher, J Med Internet Res

Associate Professor Department of Health Policy, Management and Evaluation, & KMDI, University of Toronto;

Senior Scientist, Centre for Global eHealth Innovation,Division of Medical Decision Making and Health Care Research; Toronto General Research Institute of the UHN, Toronto General Hospital, Canada

WebCite® (www.webcitation.org)WebCite® (www.webcitation.org)

WebCite® is an on-demand archiving system (controlled by citing and cited

authors, editors, and publishers), which enables long-term digital

preservation and citability of any kind of Internet-accessible object *

Mission

* webpages, blogs, wikis, data files e.g. spreadsheets, PDF-reports, “grey” research reports, preprints etc.

E-publishing & Open Access Research Group at the CGEI, Toronto

• Journal of Medical Internet Research (www.jmir.org), – Living publishing lab– a pioneer in Open Access publishing (10 yrs)– Leading journal in its discipline (Impact Factor 3.0)– “triple-O” philosophy (open access, open source, open peer-

review)– OS contributions include contributions to OJS and XML-

typesetting software (originally © MJ Suhonos, G. Eysenbach, J Alperin, code released under GNU forms basis for PKP Lemon8 project)

• CIHR-funded research on the Impact of Open Access on Knowledge Translation (see e.g. Eysenbach. PLoS Biol 4(5): e157)

• Publishing innovations incl. WebCite® (www.webcitation.org)

www.jmir.org

Authors increasingly cite non-traditional (web)references

• Webpages (e.g. personal homepages)

• “grey” PDF reports (e.g. research progress reports, etc.)

• Blogs

• Wikis

• Datasets which are available online

Note: For the purpose of this talk I refer to “webpages” or webreferences - but what I really mean is any sort of electronic digital object that can be cited and which can be deemed non-traditional (not having a DOI)

Problem 1: URLs go “dead”

Attrition rate of cited non-journal URLs

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8 10 12

years

% U

RL

s s

till

wo

rkin

g

Dellavalle RP, Hester EJ, Heilig LF, Drake AL, Kuntzman JW, Graber M, et al. Information science. Going, going, gone: lost Internet references. Science 2003 Oct 31;302(5646):787-788. DOI:10.1126/science.1088234

In one study published in the journal Science, 13% of Internet references in scholarly articles were inactive after only 27 months.

http://dx.doi.org/10.1126/science.1088234

Problem 2: Even if URLs don’t go “dead”, their content may change

Eysenbach G. Towards quality management of medical information on the internet: evaluation, labelling, and filtering of informationBMJ 1998;317:1496-1502

Today, that site looks different…

medpics.org

Wikis and Blogs change constantly

The homepage of a blog shows the most recent posts only

Problem 3: Internet material not deemed “citable”

(impedes the use of blogs, wikis, online-sharing of datasets etc.)

Editors often discourage citing web material (including datasets)

URL:http://www.plantphysiol.org/misc/ifora.shtml. Accessed: 2008-06-26. (Archived by WebCite® at http://www.webcitation.org/5YsaBISU5)

Internet material not considered citable(Deemed unstable, not archived)

Fear of plagiarism / not getting credits

Authors are reluctant to-Making data and datasets online accessible-Participate in collaborative projects (wikis)-Share information in blogs

Problem 4: Crawler-based archiving insufficient

Limitations of crawler based archiving

• No author-initiated on demand archiving on a given date/time

• “Shotgun” approach

• Crawler cannot go everywhere (“hidden web”)

• No impact statistics (how often has my archived copy been retrieved)

• Impossible to curate

WebCite = Web Archiving 2.0

The solution: WebCite®

• First mentioned as an idea and implemented as a prototype in 1998 (Eysenbach, BMJ 1998;317:1496-1502)

• Project idea revived in 2004/2005• First implemented by J Med Internet Res• Today, used by >200 journals and large

publishers (including Biomed Central, Oxford University Press)

• Became member of the International Internet Preservation Consortium in 2008

Citing Author

/comb

WebCite®

/archive Cited Author

/bookmarklet/archive

(self-archiving)

Publisher/Editor

/archive

/comb

What the world needsJ. Author

This is a sample citing paper [1].

References:1. Doe J.

www.citedwebsite.com/exmpl [Accessed 1.1.2004]

2. -------------------3. -------------------4. -------------------

XMLManuscriptwith DOI®

DOI® server

IALibraries/Digital PreservationPartners

mirrorsSnapshotRetrievalRequest (DOI with Hash)

© WebCite®

LinkResolver

Reverse (citation-triggered) archiving Self (author-triggered) archiving

Third-party archiving

CrossRef®ForwardLinking XML

(optional) DOI assignment

Reader

(dynamic content)

(static content)

Citing Author

/comb

WebCite®

/archive/bookmarklet





2. -------------------3. -------------------4. -------------------


mirrorsSnapshotRetrievalRequest

© WebCite® Third-party archiving

Reader

Eysenbach, Gunther. Gunther Eysenbach Random Research Rants Blog. 2008-06-26. URL:http://gunther-eysenbach.blogspot.com. Accessed: 2008-06-26. (Archived by WebCite® at http://www.webcitation.org/5YreMGRz7)

Eysenbach, Gunther. Gunther Eysenbach Random Research Rants Blog. 2008-06-26.http://www.webcitation.org/query?url=http%3A%2F%2Fgunther-eysenbach.blogspot.com&date=2008-06-26

Two possible citation formats to cite the WebCite snapshot

Opaque (ID-based)

Transparent

(Note that there are also others: Hash-based, and citing-document-DOI-based)

http://www.webcitation.org/5YreMGRz7



http://www.webcitation.org/query?url=http%3A%2F%2Fgunther-eysenbach.blogspot.com&date=2008-06-26



4. Displays cached version

2. Request is redirected to webcitation

www.citedwebsite.com/exmplERROR: NOT FOUND

3. Attempts to retrieve “live” cited URL, if not found displays

cached version (and/or other versions)




www.webcitation.org?cache_url=www.citedwebsite.com/exmpl&cache_date=31.1.2003 [Accessed 31.1.2004]

2. -------------------3. -------------------4. -------------------

Webcitation.org

Reader point of view: for retrieving archived material the reader simply clicks on the WebCite link

1. Reader clicks on cited webcitation-URL(on 1.1.2005)

Cached version (timestamp 31.1.2004)

Bookmarklet

Can be used to rapidly archive the currently viewed webpage

(bookmarklet hands over current URL and email adress of the citing author to the WebCite server)

Citing Author

/comb

WebCite®



(self-archiving)What the world needs

J. Author




2. -------------------3. -------------------4. -------------------


mirrorsSnapshotRetrievalRequest

© WebCite®



Reader

(dynamic content)

(static content)

As “potentially cited” author I can self-archive and add a static WebCite-enriched

reference as citation suggestion…

As “potentially cited” author I can self-archive and add a static WebCite-enriched

reference as citation suggestion…

… or I provide a dynamic link to the WebCite archiving form

(“WebCite this!”)

… or I provide a dynamic link to the WebCite archiving form

(“WebCite this!”)

Click on “WebCite this” populates the archiving form with metadata

from the cited author

(the same approach can be used by authors of wikis, datasets etc.)

Implementation from a publisher / editor point of view

Level 1-4 implementation

Time since author saw the cited webdocument

Author “webcites” document immediately(or reference manager takes care of this)Editors stipulate this in their Instructions for authors

Editor/Copyeditor “webcites” cited document before publication

1

2

WebCite® immediately archives cited webreferences on publication (combing XML files)

3

Retrospective focussed crawling of old articles4

Level 1-Implementation by journal editors: Instructions for authors

Citing Author

/comb

WebCite®

/archive/bookmarklet

Publisher/Editor

/archive

/comb





2. -------------------3. -------------------4. -------------------



mirrors

© WebCite®




Implemented by >200 journals

What’s next

Future developments

WebCite 2.0

• User accounts• Enables users to view a list of the snapshots they

created (and to categorize and export them e.g. in BibTex, Refman etc.)

• Enables tagging, “crowdsourcing” of curation tasks such as metadata entering & reconciliation

• Recommender service (people who cited x also cited y)• Post-publication peer-review (others can rate

documents)• For cited authors

– WebCite® Impact Factor (access / citation statistics, which can be used for tenure & promotion purposes)

– WebCitation-Alert service

Implementation of WebCite® in tools facilitating “archive as you cite”

• Bibliographic management systems (Endnote, reference manager) and shared bookmarks (Connotea, CiteULike)

• XML-editing software (Word 2007 XML-addin, Lemon8 etc.)

• Plugin for OJS and other manuscript management systems (allowing authors to automatically WebCite all references in their manuscript)

WebCite® works within the International Internet Preservation

Consortium (IIPC)• Collect and preserve a rich body of Internet content from

around the world

• To foster the development and use of common tools, techniques and standards that enable the creation of international archives

• To encourage and support national libraries everywhere to address Internet collecting and preservation

http://netpreserve.org

2008 IIPC Members (38)• Asia

– Jewish National and University Library (Israel) – National Diet Library, Japan – National Library Board, Singapore – National Library of China

• Europe – Biblioteca de Catalunya (Library of Catalonia) – Biblioteca Nazionale Centrale di Firenze (National Library

of Italy, Florence) – Biblioteka Narodowa (National Library of Poland) – Bibliotheque nationale de France (National Library of

France) – British Library (U.K.) – Deutsche Nationalbibliothek (German National Library) – European Archive Foundation – Hanzo Archives Ltd. (U.K.) – Kansalliskirjasto (National Library of Finland) – Koninklijke Bibliotheek (National Library of the

Netherlands) – Kungl. biblioteket (National Library of Sweden) – Landsbokasafn Islands – Haskolabokasafn (National and

University Library of Iceland) – Latvijas Nacionālā bibliotēka (National Library of Latvia) – Nacionalna i sveučilišna knjižnica u Zagrebu (National

and University Library in Zagreb, Croatia) – Narodna in univerzitetna knjižnica (National and

University Library, Slovenia) – Národní knihovna České republiky (National Library of

the Czech Republic) – Nasjonalbiblioteket (National Library of Norway)

• Europe, cont.– National Archives (U.K.) – National Library of Scotland – Netarchive.dk (Royal Library and the State and University Library,

Aarhus) – Österreichische Nationalbibliothek (Austrian National Library) – Schweizerische Nationalbibliothek (Swiss National Library) – Virtual Knowledge Studio – Royal Netherlands Academy for Arts

and Sciences

• North America– Bibliothèque et Archives Nationales du Québec (BAnQ) – California Digital Library (U.S.) – Centre for Global eHealth Innovation, WebCite® Internet

Citations Archiving Project (Canada) – Internet Archive (U.S.) – Library and Archives Canada – Library of Congress (U.S.) – Library of Virginia (U.S.) – United States Government Printing Office – University of North Texas Libraries (U.S.)

• Oceania– National Library of Australia – National Library of New Zealand

The vision

• A global infrastructure (standard APIs) – for cross-archive searching of cited URLs (by

URL & date)– Decentralized storing of archived webmaterial

• Pilot project with WebCite®, Internet Archive, and Library and Archives Canada

Summary: What WebCite® contributes

• Links/URL no longer go 404 (dead)• WebCite’d content does not change• Internet material can be deemed citable and “archived”

– Encourages “openess” (authors contribute to blogs, wikis etc., and make their datasets available)

– Takes the submission load off journals – much of the scholarly communication can take place outside of journals

• Provides access/impact statistics for cited authors• Enables one-click self-archiving• “Internet Archiving 2.0”: Enables archiving of the

“hidden/deep web” (where crawlers cannot go), collaborative assignment of metadata

Call for action

• If you are an citing author: use WebCite next time you cite a non-journal URL

• If you are a blogger or a (potentially cited) author publishing online in any other way, put a “WebCite this!” link on your page

• If you are an editor/publisher: Implement WebCite in your workflow (instructions for authors, copyeditors, XML production department)

• If you are a librarian: Contact us to become a long-term preservation partner

www.medicine20congress.com, Toronto, Sept 4-5th, 2008

Thank you!

FundingChange Foundation, Canadian Institutes for Health Research, NSERC, European Union,

SSHRC

Dr G. Eysenbach, Email: geysenba at uhnres.utoronto.ca or @gmail.com,

My peer-reviewed Journal: http://www.jmir.org

My Blog: http://gunther-eysenbach.blogspot.com

My Conferences: http://www.medicine20congress.com

http://www.ehealthcongresss.org

My Slides: http://www.slideshare.net/eysen

Appendix

Copyright Issues

• WebCite® honors robot exclusion standards and “no-archive” tags

• Copyright holders can request removal of material• “Fair use” defence (used for non-profit/scholarly

purposes, only a part of the site was archived, etc.)• U.S. court ruled that Google’s caching does not

constitute a copyright violation, because of fair use and an implied license (Field vs Google, US District Court, District of Nevada, CV-S-04-0413-RCJ-LRL)

• In the future, WebCite® may also – Allow copyright holders to specify a fee-per-access royalty fee – Long-term goal: WebCite® does not physically store anything

but instead deposits the material in the respective National Libraries etc., who often have a legal deposit mandate*

Legal deposit: a copy of any work published in COUNTRY must be deposited with the National Library of COUNTRY

WebCite® is a disruptive technology

• If online articles/material are– Permanently archived and “citable”– Findable– “Rankable” (post-publication peer-review)– (all of which WebCite® plans to implement)

• … what will be the role of the traditional scholarly journal publication?– Quality of pre-publication peer-review, editing,

copyediting is key– Value-added services (e.g. semantic markup,

curation)

<ref id="ref19"><label>19</label>－－ <nlm-citation citation-type="web"><article-title>Who Gets ALS</article-title><source>ALS Association</source><access-date>2008 Apr 25</access-date>－ <comment><ext-link xlink:type="simple"xlink:href="http://www.alsa.org/als/who.cfm"ext-link-type="uri">http://www.alsa.org/als/who.cfm</ext-link></comment><pub-id pub-id-type=“other">5Y0NuDIU9</pub-id></nlm-citation></ref>

http://www.alsa.org/als/who.cfm




<ref id="ref19"><label>19</label>－－ <nlm-citation citation-type="web"><article-title>Who Gets ALS</article-title><source>ALS Association</source><access-date>2008 Apr 25</access-date>－ <comment><ext-link xlink:type="simple"xlink:href="http://www.webcitation.org/query?url= http://www.alsa.org/als/who.cfm&date=2008-04-25"ext-link-type="uri"> http://www.webcitation.org/query?url= http://www.alsa.org/als/who.cfm&date=2008-04-25 </ext-link></comment></nlm-citation></ref>

http://www.webcitation.org/query?url=&date=2008-06-18














Citing Author

/comb

WebCite®



(self-archiving)

Publisher/Editor

/archive

/comb





2. -------------------3. -------------------4. -------------------


DOI® server


mirrorsSnapshotRetrievalRequest (DOI with Hash)

© WebCite®

LinkResolver




(optional) DOI assignment

Reader

(dynamic content)

(static content)

Technology

Preserving the scholarly record with WebCite (): an archiving system for long-term digital preservation of cited webpages