44
Herbert Van de Sompel 404/File Not Found, Washington, DC, October 24 2014 Herbert Van de Sompel @hvdsomp http://public.lanl.gov/herbertv/ Los Alamos National Laboratory Acknowledgements: Michael L. Nelson @phonedude_mln Old Dominion University Creating Pockets of Persistence

Creating Pockets of Persistence

Embed Size (px)

DESCRIPTION

Extended version of slides presented at the "404/File Not Found" symposium held at Georgetown University on October 24 2014, see http://www.law.georgetown.edu/library/404/ . The presentation provides a brief overview of the link/reference rot problem and then discusses three complimentary strategies to combat it: Pro-actively capturing web resources that are linked from a seed collection; Referencing the captures by means of annotated links; Accessing the captures using Memento infrastructure.

Citation preview

Page 1: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Herbert Van de Sompel@hvdsomp

http://public.lanl.gov/herbertv/Los Alamos National Laboratory

Acknowledgements:Michael L. Nelson@phonedude_mln

Old Dominion University

Creating Pockets of Persistence

Page 2: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Addressing the Link/Reference Rot Challenge

• Pockets of Persistence

• Capture – Archive Pro-Actively, Selectively

• Reference – Annotate Links

• Access – Travel in Time

Page 3: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Pockets of Persistence

How to achieve the ability to:

• Persistently• Precisely• Seamlessly

revisit the Web of the Past and the Web of the Now at some point in the Future

Page 4: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Pockets of Persistence

How to achieve the ability to:

• Persistently• Precisely• Seamlessly

revisit the Web of the Past and the Web of the Now at some point in the Future

Two components to the link/reference rot challenge:

• Link rot: Links stop working aka 404 Not Found

• Content drift: Referenced content changes over time

Page 5: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Illustration

Current version of http://en.wikipedia.org/wiki/Coil_(band) on October 22 2014

Page 6: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Illustration – Link Rot

Current version of http://en.wikipedia.org/wiki/Coil_(band) on October 22 2014

Page 7: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Illustration – Link Rot

Current version of http://liarsociety.tripod.com/blog/index.blog?from=20041130 on October 22 2014

Page 8: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Illustration – Content Drift

Version of http://en.wikipedia.org/wiki/Coil_(band) dated October 2 2014http://en.wikipedia.org/w/index.php?title=Coil_(band)&oldid=388321480

Page 9: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Illustration – Content Drift

Current version of http://en.wikipedia.org/wiki/Peter_Christopherson on October 22 2014

Page 10: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Illustration – Content Drift

Version of http://en.wikipedia.org/wiki/Peter_Christopherson that was current on October 2 2010http://en.wikipedia.org/w/index.php?title=Peter_Christopherson&oldid=387987414

Page 11: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Pockets of Persistence

How to achieve the ability to:

• Persistently• Precisely• Seamlessly

revisit the Web of the Past and the Web of the Now at some point in the Future

This challenge exists for the entire web, but some communities actually care about addressing it:

• scholarly communication,• legal publications,• journalism,• Wikipedia,• …

Mobilize the communities that care about this problem to work towards joint, interoperable solutions, approaches

Page 12: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Addressing the Link/Reference Rot Challenge

• Pockets of Persistence

• Capture – Archive Pro-Actively, Selectively

• Reference – Annotate Links

• Access – Travel in Time

Page 13: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Pro-Active Capture for a Seed Collection

• Seed Collection - Starting point for capture is a seed collection of interest to communities that care, e.g.o Scholarly literatureo Legal documentso On-Line journalismo Wikipedia articles

• Lifecycle Events – Intervene at critical moments in the lifecycle of items in these collections to pro-actively capture o Collection items – some solutions in placeo Web resources referenced in collection items

Page 14: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Pro-Active Capture for Seed Collection

• What those crucial lifecycle events are may depend on the collection type

Wikipedia

• Creation of new article• Creation of new version of

article• Creation of substantially

new version of article• Addition of external

reference to article• References to article

exceed a certain threshold

Scholarly Literature

Page 15: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Authoring Legal Documents – perma.cc

http://perma.cc

Page 16: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Authoring Scholarly Literature: Experimental Zotero Extension

Richard Wincewicz (2014) Prototype Hiberlink plugin for Zotero for pro-active archiving and temporal referenceshttps://www.youtube.com/v/ZYmi_Ydr65M%26vq

Page 17: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Submitting Scholarly Literature: Experimental HiberActive Service

Martin Klein et al. (2014) HiberActive: Pro-Active Archiving of web references from scholarly articlesOpen Repositories 2014 http://www.slideshare.net/martinklein0815/hiberactive

Page 18: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Pro-Active Capture for Seed Collection

• Interoperability for on-demand capture:o Need basic interoperability for machine-driven on-demand

capture:- Discovery of capture interface- Interface IN - [ Original URI ]- Interface OUT - [ URI of Capture ; Capture Datetime ]

Page 19: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Addressing the Link/Reference Rot Challenge

• Pockets of Persistence

• Capture – Archive Pro-Actively, Selectively

• Reference – Annotate Links

• Access – Travel in Time

Page 20: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Reference Captures and Annotate Links

• Existing practice for linking to captures:o Link to URI of Captureo Lose Original URIo Lose Capture Datetime

• Problems with existing practice:o Impossible to visit the original URI, if desiredo Requires the permanent existence/uptime of the archive that

holds the capture- One link rot problem replaced by another

Van de Sompel, H. et al. (2013) Thoughts on referencing, linking, reference rothttp://mementoweb.org/missing-link/

Page 21: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Permanent Existence/Uptime of Archives?

Capture of http://webcitation.org dated July 17 2013https://archive.today/eAETp

Page 22: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Permanent Existence/Uptime of Archives?

http://webcitation.org/ on August 6 2014

Page 23: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Permanent Existence/Uptime of Archives?

Remnant of discontinued web archive http://mummify.it captured on February 14 2014https://web.archive.org/web/20140214233752/https://www.mummify.it/

Page 24: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Permanent Existence/Uptime of Archives?

http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over-islamic-state-video/510074.html

Page 25: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Hacking Original URI, Capture Datetime from Capture URI?

URI of Capture Original URI Datetime T

https://web.archive.org/web/20140214233752/https://www.mummify.it

yes yes

https://archive.today/eAETp no no

http://perma.cc/4RH7-999Q?type=source no no

http://en.wikipedia.org/w/index.php?title=Coil_(band)&oldid=388321480

no no

Page 26: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Using Capture URI to find Captures in Other Web Archives?

Page 27: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Using Capture URI to find Captures in Other Web Archives?

Page 28: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Reference Captures and Annotate Links

• Desired practice for linking to captures is to annotate the link so it conveys:

- URI of Capture- Original URI- Capture Datetime

• Link annotation supports fallback to other archives:o Original URI allows finding captures in all web archiveso Capture Datetime allows finding an appropriate capture in all

web archiveso Original URI and Capture Datetime allows automatic access

to an appropriate capture in all web archives (see Access)

Van de Sompel, H. et al. (2013) Thoughts on referencing, linking, reference rothttp://mementoweb.org/missing-link/

Page 29: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Reference Captures and Annotate Links

• Desired practice for linking to captures is to annotate the link so it conveys:

URI of Capture

Original URI Capture Datetime

Page 30: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Reference Captures and Annotate Links

• Interoperability for link annotation:o Need an approach to convey, in a uniform, machine-

actionable way:- URI of Capture- Original URI- Capture Datetime

• Ongoing efforts:o Missing Link Proposal

- http://mementoweb.org/missing-link/o W3C Robustness and Archiving Community Group

- http://www.w3.org/community/irobar/

Page 31: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Missing Link Proposal

URI of Capture

<a href=“http://liarsociety.tripod.com/blog/index.blog?from=20041130” data-versionurl=“https://archive.today/ElCHn” data-versiondate=“2008-02-06T00:00:00Z”>

Capture Datetime

Original URI

Van de Sompel, H. et al. (2013) Thoughts on referencing, linking, reference rothttp://mementoweb.org/missing-link/

Page 32: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Addressing the Link/Reference Rot Challenge

• Pockets of Persistence

• Capture – Archive Pro-Actively, Selectively

• Reference – Annotate Links

• Access – Travel in Time

Page 33: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Memento Web Time Travel

Use the Original URI

Current version of http://law.georgetown.edu/library/404/ on October 22 2014

Page 34: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Memento Web Time Travel

And a Datetime

Page 35: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Memento Web Time Travel

To automatically retrieve the temporally nearest available capture

Capture of http://law.georgetown.edu/library/404/ dated May 3 2014http://wayback.archive-it.org/all/20140503094327/http://www.law.georgetown.edu/library/404/

Page 36: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Memento Web Time Travel

http://mementoweb.org

http://bit.ly/memento-for-chrome

Page 37: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Travel in Time - Persistently, Precisely, Seamlessly

On-Demand Capture URI of Capture Original URI Datetime T

AvailableAccessible

+ - -

• Time Travel is:

• Persistent – See next slide

• Precise – Following link to URI of Capture retrieves exact capture

• Seamless – Requires clicking a link as usual

Page 38: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Travel in Time - Persistently, Precisely, Seamlessly

On-Demand Capture URI of Capture Original URI Datetime T

AvailableNot Accessible

+ - -

• Time Travel is:

• Persistent – Following link to URI of Capture leads nowhere

• Precise – Following link to URI of Capture leads nowhere

• Seamless – Following link to URI of Capture leads nowhere

Page 39: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Travel in Time - Persistently, Precisely, Seamlessly

On-Demand Capture URI of Capture Original URI Datetime T

AvailableNot Accessible

+ + +

• Time Travel is:

• Persistent – Using Memento with [ Original URI ; Datetime ] works across web archives, versioning systems

• Precise – Using Memento with [ Original URI ; Datetime ] retrieves nearest capture from other archive

• Seamless – Requires browser plugin

Page 40: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Travel in Time - Persistently, Precisely, Seamlessly

On-Demand Capture URI of Capture Original URI Datetime T

AvailableAccessible

- + +

• Time Travel is:

• Persistent – Using Memento with [ Original URI ; Datetime ] works across web archives, versioning systems

• Precise – Using Memento with [ Original URI ; Datetime ] retrieves exact capture from other archive

• Seamless – Requires browser plugin

Page 41: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Travel in Time - Persistently, Precisely, Seamlessly

On-Demand Capture URI of Capture Original URI Datetime T

Not Available - + +

• Time Travel is:

• Persistent – Using Memento with [ Original URI ; Datetime ] works across web archives, versioning systems

• Precise – Using Memento with [ Original URI ; Datetime ] retrieves nearest capture from other archive

• Seamless – Requires browser plugin

Page 42: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Reference Captures and Annotate Links

• Interoperability for time travel:o Memento protocol specifies interoperability across web

archives, version management systemso Memento protocol is supported by major web archiveso Need to work towards Memento support by version

management systemso Need to work towards making Memento experience

seamless through native browser supporto Need to work towards robustness and sustainability of

Memento infrastructure

Page 43: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Conclusion

• Significant technical solutions, infrastructure, ideas exist to address the link rot/reference rot challenge

• Mobilize the communities that care about this challenge to work towards joint, interoperable approaches

Page 44: Creating Pockets of Persistence

Herbert Van de Sompel404/File Not Found, Washington, DC, October 24 2014

Creating Pockets of Persistence

http://mementoweb.org

http://hiberlink.org