Upload
steven-colin-grant
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
WADL 2013 July 25-26th Indianapolis, IN
Martin Klein@mart1nkle1n
SiteStory Archiving Done Differently
http://mementoweb.github.io/SiteStory/
Justin F. [email protected]
WADL 2013 July 25-26th Indianapolis, IN
LANL SiteStory Teamlead developer
WADL 2013 July 25-26th Indianapolis, IN
Archiving - the traditional way
• Actively crawl the web• For example, using Heritrix
WADL 2013 July 25-26th Indianapolis, IN
• Issues with crawler based archiving:• Request can be rejected (robots.txt, user-agent, IP)• Can be deceived (geo-location, user-agent)• Can be trapped (crawl my calendar!)• Requires constant and massive bandwidth• Implied timing problem, when to crawl?
Archiving - the traditional way
WADL 2013 July 25-26th Indianapolis, IN
Timing problem:• Update 1 viewed but not archived
t1
Rcreated
t2
browservisit1
t3
crawlervisit1
t4
R update1
t5
browservisit2
t6
Rupdate2
Archiving - the traditional way
WADL 2013 July 25-26th Indianapolis, IN
Archiving - the SiteStory way
• Transactional Web archiving• Archive accepts HTTP transaction between browser
and server
WADL 2013 July 25-26th Indianapolis, IN
Timing problem:• Update 1 viewed and archived
t1
Rcreated
t2
browservisit1
t3
crawlervisit1
t4
R update1
t5
browservisit2
t6
Rupdate2
Archiving - the traditional way
WADL 2013 July 25-26th Indianapolis, IN
WADL 2013 July 25-26th Indianapolis, IN
• Challenges with transactional archiving:• To be archived server has to cooperate• Transfer data to archive, batch mode or real-time• Archive must trust transmission to be authentic• Resources from external servers have to be archived
out-of-band• Deduplication challenges
• Alias: different URI, same response• Conneg: same URI, different response• Determine “significant” content change
Archiving - the SiteStory way
WADL 2013 July 25-26th Indianapolis, IN
SiteStory Status Quo
• mod_sitestory sends HTTP PUT to SiteStory Web Archive upon client’s GET request• not for POST, DELETE, etc• for HTTP response codes 200, 302, 303
• Client IP can be included in stored headers, configurable• Header info stored in BerkeleyDB, response body in FS• Dedup via hash(body)• Offloading content as WARC files possible
(read: recommended)
WADL 2013 July 25-26th Indianapolis, IN
SiteStory Use Case
• http://www.dans.knaw.nl• LANL has been archiving the DANS website (forever)• ~32 GB since mid April 2013• >200k resources
WADL 2013 July 25-26th Indianapolis, IN
To Appear: TPDL 2013
• SiteStory benchmark with ab & wgeto ApacheBench (ab): server stress test toolo wget: Web page download
- All content: -p • Local network• Negligible difference between
SiteStory and No SiteStory
WADL 2013 July 25-26th Indianapolis, IN
Re-executed on testbed
ws-dl-03.cs.odu.edu
x99
,…,
,
megalodon.lanl.gov
@AWS
WADL 2013 July 25-26th Indianapolis, IN
Testing with ab
WADL 2013 July 25-26th Indianapolis, IN
Testing with wget
WADL 2013 July 25-26th Indianapolis, IN
Round Trip Time -- Distributed
WADL 2013 July 25-26th Indianapolis, IN
Results
• Distributed: Higher variance• Increased delay due to network• On vs. Off Comparison still comparable• Viable solution without crippling service
WADL 2013 July 25-26th Indianapolis, IN
SiteStory Installation
• Apache module mod_sitestory• Option to exclude a list of directories
• SiteStory Web Archive• Trivial for existing Tomcat environments• Tanuki Java wrapper (stand-alone) available
• Configure, open ports, go!
Or…
WADL 2013 July 25-26th Indianapolis, IN
SiteStory Testbed
We have a SiteStory Web Archive installed for you!
1. Install and configure mod_sitestory
2. Send an email containing:
1. Your contact info
2. Web server IP address
3. Server domain name used
3. Happy Sitestory’ing!
mailto: [email protected]
http://mementoweb.github.io/SiteStory/
WADL 2013 July 25-26th Indianapolis, IN
Martin Klein@mart1nkle1n
SiteStory Archiving Done Differently
http://mementoweb.github.io/SiteStory/
Justin F. [email protected]