Upload
sawood-alam
View
959
Download
3
Embed Size (px)
Citation preview
Avoiding Zombies in Archival Replay Using
ServiceWorker
Sawood Alam, Mat Kelly, Michele C. Weigle, and Michael L. NelsonWeb Science and Digital Libraries Research Group
Old Dominion University, Norfolk, VA, 23529
@ibnesayeed@WebSciDL
Supported in part by NSF III 15267001
WADL 2017, June 22-23, 2017, Toronto, Ontario, Canada
Sawood Alam <@ibnesayeed>
2008 Memento Seen in 2017
2
● https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
?
Sawood Alam <@ibnesayeed>
2008 Memento Seen in 2012
3
● http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
Sawood Alam <@ibnesayeed>
XenLand @ Alpha Centauri
4
Sawood Alam <@ibnesayeed>
Zombies in Archive
5
?
Sawood Alam <@ibnesayeed>
Zombies in Archive
6
<img src="http://xenland.alpha/images/map.png">// Is rewritten on replay to become:<img src="http://archive.example.org/1998/http://xenland.alpha/images/map.png">
// URLs constructed by JavaScript are harder to rewrite on replay, e.g.:var base = 'http://xenland.alpha';var imgdir = '/images/';var img = document.createElement('img');img.src = base + imgdir + 'ruler.png';document.getElementById('ruler').appendChild(img);//=>> http://xenland.alpha/images/ruler.png
Sawood Alam <@ibnesayeed>
Replay URL Resolution & Rewriting
7
Reference type Example Resolution after relocation
Relative path images/logo.png Potentially correct
Absolute path /public/images/logo.png Potentially incorrect
Absolute URL http://example.com/public/images/logo.png Potentially live leakage
http://example.com/public/index.html
...<img src="/public/images/logo.png">...
http://archive.example.org/<datetime>/http://example.com/public/index.html
...<img src="/<datetime>/http://example.com/public/images/logo.png">...
Sawood Alam <@ibnesayeed>
Avoiding Zombies
● Ahead-of-time rendering and JS execution○ http://archive.is/
● Archival replay proxy○ https://github.com/ikreymer/pywb/wiki/Pywb-Proxy-Mode-Usage
● Browser extension○ MementoFox (deprecated)
● JS override○ wombat.js in PyWB
● ServiceWorker
8
Sawood Alam <@ibnesayeed>
● New web API (still a working draft)● A standalone JavaScript file● Persists in the browser independent of the window● Acts as a proxy● Installed by a web page under its domain at a specific path (called scope)● Intercepts all requests in scope
○ Resources under the scope path (at any depth)○ Secondary resource requests originated from any resource under scope
● Allows modification in request and response● Primarily used in web applications for offline access and notification support● Requires HTTPS● Growing browser support (73.61% as of June 8, 2017)
ServiceWorker
9● http://caniuse.com/#feat=serviceworkers
Sawood Alam <@ibnesayeed>
reconstructive.js
10● https://github.com/oduwsdl/reconstructive
● A ServiceWorker script written for archival replay● Plug-in for web archives or Memento aggregators● Intercepts all network requests originated from a memento● Reroutes requests to an archive (prevents live leakage & incorrect references)● Optionally rewrites the content to add banner & to fix hyperlinks
Sawood Alam <@ibnesayeed>
Zombies, No More!
11● https://github.com/oduwsdl/ipwb
Sawood Alam <@ibnesayeed>
Rewriting Mementos is Expensive
12
Original capture (without any rewriting)
In our experiment over 500 home pages we observed:
● One-fifth mean data overhead● One-third mean time overhead
15% more data in twice the time
Sawood Alam <@ibnesayeed>
Archival Capture Replay Test Suite (ACRTS)
13
reconstructive.js
● https://ibnesayeed.github.io/acrts/
Sawood Alam <@ibnesayeed>
Reconstruction Winners: PyWB & reconstructive.js
A. OpenWaybackB. PyWBC. Memento
ReconstructD. Memento for
ChromeE. reconstructive.js
14
Sawood Alam <@ibnesayeed>
Future Work
● Use “Prefer” header for original content (when archives support it)● Add a customizable archival banner● Add click handler for lazy rewriting of hyperlinks● Handle archived ServiceWorkers● Write a 404-combat ServiceWorker script for webmasters
15
● http://ws-dl.blogspot.co.uk/2016/08/2016-08-15-mementos-in-raw-take-two.html
Sawood Alam <@ibnesayeed>
● reconstructive.js => no zombies!● Rerouting instead of rewriting (lazy rewriting)● Mean overhead reduction
○ one-fifth data○ one-third time
● 73.61% (and growing) browser support for ServiceWorker○ http://caniuse.com/#feat=serviceworkers
● reconstructive.js○ https://github.com/oduwsdl/reconstructive
● Archival Capture Replay Test Suite○ https://ibnesayeed.github.io/acrts/
Conclusions
16