Lazy Preservation, Warrick, and the Web Infrastructure

Frank McCown

Old Dominion UniversityComputer Science Department

Norfolk, Virginia, USA

JCDL 2007Vancouver, BCJune 19, 2007

Outline

• What is the Web Infrastructure (WI)?• How can the WI be used for preservation?• Web-repository crawling with Warrick• Understanding the WI

– Caching experiment– Reconstruction experiments– Search engine sampling and IA overlap experiment

• Recovering web server components from the WI• Brass: Queueing manager for Warrick

Web Infrastructure

Alternative Models of Preservation

• Lazy Preservation– Let Google, IA et al. preserve your website

• Just-In-Time Preservation– Wait for it to disappear first, then a “good enough”

version

• Shared Infrastructure Preservation– Push your content to sites that might preserve it

• Web Server Enhanced Preservation– Use Apache modules to create archival-ready

resources

7Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

Crawling the Crawlers

World Wide Web

Web crawling

Web-repository crawling

Cached Image

Cached PDF

http://www.fda.gov/cder/about/whatwedo/testtube.pdf

MSN version Yahoo version Google version

canonical

Web-repository Crawler

• McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.

• McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.

• McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006.

• McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.

Available at http://warrick.cs.odu.edu/

What Types of Websites Are Lost?

Marshall, McCown, and Nelson, Evaluating Personal Archiving Strategies for Internet-based Information, IS&T Archiving 2007.

Outline

Understanding the WI

• How quickly do search engines acquire and purge their caches?

• Do search engines prefer caching one type of resource over another?

• How much overlap is there between the search engines caches and IA holdings?

• How successfully can we reconstruct a lost website?

• Are some resources more recoverable than others?

Timeline of Web Resource

Web Caching Experiment

• Create 4 websites composed of HTML, PDFs, and images– http://www.owenbrau.com/– http://www.cs.odu.edu/~fmccown/lazy/– http://www.cs.odu.edu/~jsmit/– http://www.cs.odu.edu/~mln/lazp/

• Remove pages each day

• Query GMY every day using identifiers

McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.

Where is the Internet Archive?

• No crawls from Alexa, IA’s provider

• Even if they had crawled us, the content would not be accessible from IA for 6-12 months

• Short-lived web content is likely to be lost for good

2005 Reconstruction Experiment

• Crawl and reconstruct 24 sites of various sizes:

1. small (1-150 resources) 2. medium (151-499 resources)3. large (500+ resources)

• Perform 5 reconstructions for each website– One using all four repositories together– Four using each repository separately

• Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)

How Much Did We Reconstruct?

“Lost” web site Reconstructed web site

B’ C’

Missing link to D; points to old resource G

F can’t be found

Four categories of recovered resources:

1) Identical: A, E2) Changed: B, C3) Missing: D, F4) Added: G

Reconstruction Diagram

added 20%

identical 50%

changed 33%

missing 17%

Recovery Success by MIME Type

Repository Contributions

2006 Reconstruction Experiment

• 300 websites chosen randomly from Open Directory Project (dmoz.org)

• Crawled and reconstructed each website every week for 14 weeks

• Examined change rates, age, decay, growth, recoverability

McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.

Success of website recovery each week

*On average, we recovered 61% of a website on any given week.

Statistics for Repositories

Experiment: Sample Search Engine Caches

• Feb 2006

• Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo

• Randomly selected 1 result from first 100

• Download resource and cached page

• Check for overlap with Internet Archive

McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.

Distribution of Top Level Domains

Cached Resource Size Distributions

976 KB 977 KB

1 MB 215 KB

Cache Freshness

crawled and cached

changed on web server

crawled and cached

Fresh Fresh

Staleness = max(0, Last-modified http header – cached date)

Cache Staleness

• 46% of resource had Last-Modified header

• 71% also had cached date

• 16% were at least 1 day stale

Similarity vs. Staleness

How much of the Web is indexed?

Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

Google

MSNIndexable

8 billion pages

6.6 billion pages

5 billion pages

11.5 billion pages

Internet Archive?

Overlap with Internet Archive

Distribution of Sampled URLs

Problem:

WI currently only stores the client-side representation of a website. Server components (scripts, databases, configuration files, etc.) are not

accessible from the WI

Outline

Database

Perlscript

config

Static files (html files, PDFs,

images, style sheets, Javascript, etc.)

Web Infrastructure

Web Server

Dynamicpage

Recoverable

Not Recoverable

Injecting Server Components into Crawlable Pages

Erasure codesHTML pages Recover at least

m blocks

Brass: A Queueing Manager for Warrick

• Warrick requires some technical expertise to download, install, and run

• Warrick uses search engine APIs which allow limited requests per IP address (or key)

• Google no longer provides new keys for accessing their API

Thank You

Frank McCown

fmccown@cs.odu.eduhttp://www.cs.odu.edu/~fmccown/

Can’t wait until I’m old enough to

recover a website!

Cache Freshness

crawled and cached

changed on web server

crawled and cached

Fresh Fresh

Staleness = max(0, Last-modified http header – cached date)

Cache Staleness

• 46% of resource had Last-Modified header

• 71% also had cached date

• 16% were at least 1 day stale

Similarity vs. Staleness

Web Repository CharacteristicsType MIME type File ext Google Yahoo Live IA

HTML text text/html html C C C C

Plain text text/plain txt, ans M M M C

Graphic Interchange Format image/gif gif M M M C

Joint Photographic Experts Group

image/jpegjpg

M M M C

Portable Network Graphic image/png png M M M C

Adobe Portable Document Format

application/pdfpdf

M M M C

JavaScript application/javascript js M M C

Microsoft Excel application/vnd.ms-excel xls M ~S M C

Microsoft PowerPoint application/vnd.ms-powerpoint

pptM M M C

Microsoft Word application/msword doc M M M C

PostScript application/postscript ps M ~S C

C Canonical version is storedM Modified version is stored (modified images are thumbnails, all others are html conversions)~S Indexed but not stored

Results

Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.

Lazy Preservation, Warrick, and the Web Infrastructure

Documents

Lazy Preservation, Warrick, and the Web Infrastructure

Warrick Conductivity-Based Liquid Level Control · Warrick Intro / p1of3 / 3-APR-14 Warrick® Conductivity-Based Liquid Level Control The concept is simple: Take advantage of a liquid’s

An Independent Woman: The Life and Art of Meta Warrick ... Vaux Warrick was born in Philadelphia in 1877, ... the young Meta Warrick grew up in an atmos ... for her work during this

LAZY LAMHE

Lazy Town.first

Lazy Logic

Lazy Brevard

Lazy ahmet

Meta Warrick Fuller - Danforth Art · Meta Warrick Fuller Program: Educator Resources 3. Meta Warrick Fuller was a local African-American sculptor of national importance. She moved

CONTRACTUAL AGREEMENT - Warrick County …The Board hereby recognizes the Warrick County Teachers Association as the exclusive representative of teachers in this School Corporation

Warrick 1981 Leadership Styles and Their Consequences

Enjoy the lazy days Enjoy the lazy days ... - retirees.ucr.eduretirees.ucr.edu/towertalk/TowerTalk2012May.pdfEnjoy the lazy days Enjoy the lazy days of summer!!!! Tower Talk 2 View

Dynamic Forces Fig. from Warrick et al. 2005. Nature

Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCown Advisor: Michael L. Nelson Old Dominion University Computer Science

Lazy Marsupial

Siobhan Cahill, Chris Charles, Jessica Potter, Amy Warrick

Class of 2017 Enrollment Packet - Warrick County School Corporation

SUBDIVISION CONTROL ORDINANCE - Warrick County

CNMS HEADLINES - Warrick County School Corporation

Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007