16
Avoiding Zombies in Archival Replay Using ServiceWorker Sawood Alam, Mat Kelly, Michele C. Weigle, and Michael L. Nelson Web Science and Digital Libraries Research Group Old Dominion University, Norfolk, VA, 23529 @ibnesayeed @WebSciDL Supported in part by NSF III 1526700 1 WADL 2017, June 22-23, 2017, Toronto, Ontario, Canada

Avoiding Zombies in Archival Replay Using ServiceWorker

Embed Size (px)

Citation preview

Page 1: Avoiding Zombies in Archival Replay Using ServiceWorker

Avoiding Zombies in Archival Replay Using

ServiceWorker

Sawood Alam, Mat Kelly, Michele C. Weigle, and Michael L. NelsonWeb Science and Digital Libraries Research Group

Old Dominion University, Norfolk, VA, 23529

@ibnesayeed@WebSciDL

Supported in part by NSF III 15267001

WADL 2017, June 22-23, 2017, Toronto, Ontario, Canada

Page 2: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

2008 Memento Seen in 2017

2

● https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html

?

Page 3: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

2008 Memento Seen in 2012

3

● http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html

Page 4: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

XenLand @ Alpha Centauri

4

Page 5: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

Zombies in Archive

5

?

Page 6: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

Zombies in Archive

6

<img src="http://xenland.alpha/images/map.png">// Is rewritten on replay to become:<img src="http://archive.example.org/1998/http://xenland.alpha/images/map.png">

// URLs constructed by JavaScript are harder to rewrite on replay, e.g.:var base = 'http://xenland.alpha';var imgdir = '/images/';var img = document.createElement('img');img.src = base + imgdir + 'ruler.png';document.getElementById('ruler').appendChild(img);//=>> http://xenland.alpha/images/ruler.png

Page 7: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

Replay URL Resolution & Rewriting

7

Reference type Example Resolution after relocation

Relative path images/logo.png Potentially correct

Absolute path /public/images/logo.png Potentially incorrect

Absolute URL http://example.com/public/images/logo.png Potentially live leakage

http://example.com/public/index.html

...<img src="/public/images/logo.png">...

http://archive.example.org/<datetime>/http://example.com/public/index.html

...<img src="/<datetime>/http://example.com/public/images/logo.png">...

Page 8: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

Avoiding Zombies

● Ahead-of-time rendering and JS execution○ http://archive.is/

● Archival replay proxy○ https://github.com/ikreymer/pywb/wiki/Pywb-Proxy-Mode-Usage

● Browser extension○ MementoFox (deprecated)

● JS override○ wombat.js in PyWB

● ServiceWorker

8

Page 9: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

● New web API (still a working draft)● A standalone JavaScript file● Persists in the browser independent of the window● Acts as a proxy● Installed by a web page under its domain at a specific path (called scope)● Intercepts all requests in scope

○ Resources under the scope path (at any depth)○ Secondary resource requests originated from any resource under scope

● Allows modification in request and response● Primarily used in web applications for offline access and notification support● Requires HTTPS● Growing browser support (73.61% as of June 8, 2017)

ServiceWorker

9● http://caniuse.com/#feat=serviceworkers

Page 10: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

reconstructive.js

10● https://github.com/oduwsdl/reconstructive

● A ServiceWorker script written for archival replay● Plug-in for web archives or Memento aggregators● Intercepts all network requests originated from a memento● Reroutes requests to an archive (prevents live leakage & incorrect references)● Optionally rewrites the content to add banner & to fix hyperlinks

Page 11: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

Zombies, No More!

11● https://github.com/oduwsdl/ipwb

Page 12: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

Rewriting Mementos is Expensive

12

Original capture (without any rewriting)

In our experiment over 500 home pages we observed:

● One-fifth mean data overhead● One-third mean time overhead

15% more data in twice the time

Page 13: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

Archival Capture Replay Test Suite (ACRTS)

13

reconstructive.js

● https://ibnesayeed.github.io/acrts/

Page 14: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

Reconstruction Winners: PyWB & reconstructive.js

A. OpenWaybackB. PyWBC. Memento

ReconstructD. Memento for

ChromeE. reconstructive.js

14

Page 15: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

Future Work

● Use “Prefer” header for original content (when archives support it)● Add a customizable archival banner● Add click handler for lazy rewriting of hyperlinks● Handle archived ServiceWorkers● Write a 404-combat ServiceWorker script for webmasters

15

● http://ws-dl.blogspot.co.uk/2016/08/2016-08-15-mementos-in-raw-take-two.html

Page 16: Avoiding Zombies in Archival Replay Using ServiceWorker

Sawood Alam <@ibnesayeed>

● reconstructive.js => no zombies!● Rerouting instead of rewriting (lazy rewriting)● Mean overhead reduction

○ one-fifth data○ one-third time

● 73.61% (and growing) browser support for ServiceWorker○ http://caniuse.com/#feat=serviceworkers

● reconstructive.js○ https://github.com/oduwsdl/reconstructive

● Archival Capture Replay Test Suite○ https://ibnesayeed.github.io/acrts/

Conclusions

16