20
WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail. com SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Justin F. Brunelle [email protected]

WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n [email protected] SiteStory Archiving Done Differently

Embed Size (px)

Citation preview

Page 1: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

Martin Klein@mart1nkle1n

[email protected]

SiteStory Archiving Done Differently

http://mementoweb.github.io/SiteStory/

Justin F. [email protected]

Page 2: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

LANL SiteStory Teamlead developer

Page 3: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

Archiving - the traditional way

• Actively crawl the web• For example, using Heritrix

Page 4: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

• Issues with crawler based archiving:• Request can be rejected (robots.txt, user-agent, IP)• Can be deceived (geo-location, user-agent)• Can be trapped (crawl my calendar!)• Requires constant and massive bandwidth• Implied timing problem, when to crawl?

Archiving - the traditional way

Page 5: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

Timing problem:• Update 1 viewed but not archived

t1

Rcreated

t2

browservisit1

t3

crawlervisit1

t4

R update1

t5

browservisit2

t6

Rupdate2

Archiving - the traditional way

Page 6: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

Archiving - the SiteStory way

• Transactional Web archiving• Archive accepts HTTP transaction between browser

and server

Page 7: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

Timing problem:• Update 1 viewed and archived

t1

Rcreated

t2

browservisit1

t3

crawlervisit1

t4

R update1

t5

browservisit2

t6

Rupdate2

Archiving - the traditional way

Page 8: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

Page 9: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

• Challenges with transactional archiving:• To be archived server has to cooperate• Transfer data to archive, batch mode or real-time• Archive must trust transmission to be authentic• Resources from external servers have to be archived

out-of-band• Deduplication challenges

• Alias: different URI, same response• Conneg: same URI, different response• Determine “significant” content change

Archiving - the SiteStory way

Page 10: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

SiteStory Status Quo

• mod_sitestory sends HTTP PUT to SiteStory Web Archive upon client’s GET request• not for POST, DELETE, etc• for HTTP response codes 200, 302, 303

• Client IP can be included in stored headers, configurable• Header info stored in BerkeleyDB, response body in FS• Dedup via hash(body)• Offloading content as WARC files possible

(read: recommended)

Page 11: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

SiteStory Use Case

• http://www.dans.knaw.nl• LANL has been archiving the DANS website (forever)• ~32 GB since mid April 2013• >200k resources

Page 12: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

To Appear: TPDL 2013

• SiteStory benchmark with ab & wgeto ApacheBench (ab): server stress test toolo wget: Web page download

- All content: -p • Local network• Negligible difference between

SiteStory and No SiteStory

Page 13: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

Re-executed on testbed

ws-dl-03.cs.odu.edu

x99

,…,

,

megalodon.lanl.gov

@AWS

Page 14: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

Testing with ab

Page 15: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

Testing with wget

Page 16: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

Round Trip Time -- Distributed

Page 17: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

Results

• Distributed: Higher variance• Increased delay due to network• On vs. Off Comparison still comparable• Viable solution without crippling service

Page 18: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

SiteStory Installation

• Apache module mod_sitestory• Option to exclude a list of directories

• SiteStory Web Archive• Trivial for existing Tomcat environments• Tanuki Java wrapper (stand-alone) available

• Configure, open ports, go!

Or…

Page 19: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

SiteStory Testbed

We have a SiteStory Web Archive installed for you!

1. Install and configure mod_sitestory

2. Send an email containing:

1. Your contact info

2. Web server IP address

3. Server domain name used

3. Happy Sitestory’ing!

mailto: [email protected]

http://mementoweb.github.io/SiteStory/

Page 20: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently

WADL 2013 July 25-26th Indianapolis, IN

Martin Klein@mart1nkle1n

[email protected]

SiteStory Archiving Done Differently

http://mementoweb.github.io/SiteStory/

Justin F. [email protected]