Archiving the Web: why bother ? LA Times (March 2000)

Preview:

Citation preview

Archiving the Web: why bother ?

LA Times (March 2000)

Archiving the Web: why bother ?

• “Web sites are an increasingly important part of [an] institution’s digital assets and of [a] country’s information and cultural heritage.” (JISC – April 2002)

• “A lot of history is born digital. This should not be like early television where there is no record.” (Brewster Kahle – May 2002)

Archiving the Web: who bothers?

• Australia• USA• Nordic countries: Denmark, Finland,

Sweden• Other countries: UK, France, Japan• Internet Archive

– “Wayback Machine”

Three conferences:

• What’s next for Digital Deposit Libraries ? Darmstadt, September 2001

• International Symposium on Web Archiving. Tokyo, January 2002

• DPC Forum: Web-archiving. London, March 2002.

Issues and Questions

• Legal Deposit of Digital Information ?– European Union Copyright Directive

• Copyright ?

• Open or closed access ?

• Selective or comprehensive ?

• When in the life cycle ? How often ?

• Capturing the experience – – Dynamic web sites

Technical challenges

• Embedded external links and executable programs

• Persistent naming and date stamping

• Duplicate control

• Change in content over time

• Surface web vs Deep web

Australia (PANDORA Archive) – NLA http://www.nla.gov.au/pandora

• As yet no legal deposit.• Mandate for collecting C’wlth

Government publications • Selective

– (Australian e-journals, organisational sites, government publications, ephemera)

• Accessible by public – Catalogued in the NBD

Australia (PANDORA Archive)

• ~1700 titles in the Archive (Nov. 2001)– Growth rate: 40 sites/month

– Regathering: 35 sites/month

• ADRI (Australian Digital Resource Identifier) – Unique identification scheme

– In-house resolving system

USA (Minerva) - Library of Congress

• (Mapping the Internet Electronic Resources Virtual Archive)

• Open access materials from the Web• Changes in copyright law under

discussion• Selective inclusion• Public access

LC/IA Pilot Project – “Election 2000”

• Joint pilot project Library of Congress and Internet Archive• Objectives:

– Library pilot : selection, collection and cataloguing web sites; build prototype access system

– Internet Archive pilot: gain experience in harvesting and archiving sites

• Over 800 websites (150+ selected sites and major sites hyperlinked to/from those sites)

• 2-3 terabytes of data• Archived daily August 2000 to January 2001

Denmark http://www.netarchive.dk

• Royal Library, Copenhagen. • Limited legal deposit of electronic publications

– Static, not dynamic publications – finite units• Access only from workstations at Royal Library

and State and University Library• Archiving static websites (monographs,

periodicals)• Server mirrored nightly to State and University

Library, Arhus

Denmark (Statistics)

• June 2001 - archived 9000 net publications – 31% monographs, 69% periodicals

– 67.5% public sector/university, 32.5 private sector publications

• Staff resources 0.5 technical; 0.8 librarian

Sweden (Royal Library)

• Take snapshots of Swedish Web several times/year

– No selection - take everything

– All www pages in Sweden, all articles in e-journals, all Swedish newspapers

– Definition of Sweden: .se - .com, .org .net with Swedish address or telephone number

– Archive only - no public access as yet.

Sweden (Software)

• Uses Whois to identify Swedish sites in non-.se domains

• Harvesting with COMBINE Robot software (Univ. of Lund)– Collects papers by automatically following

hypertext links– Also collects pictures and sound– Fully automatic - no human intervention

Swedish Archive (Kulturarw3) http://www.kb.se/kw3

• Everything associated with an object and metadata stored in one file as a multipart MIME object

• Name of the file: 33 character string with time stamp

• Sept 2001: 110 million files - 3000 Gbytes of data from 97,000 web servers

• Stored on disk and magnetic tapes using Hierarchical Storage Management (HSM)

Swedish Archive (Kulturarw3) (2)• Prior to July 2002: Limited legal deposit (fixed

form e- documents)

• December 2001 : Data Inspection Board team

confirms project is illegal. Project suspended • July 2002. Amendments to Swedish copyright law.

Gives Royal Library right to collect the Swedish

web and to make the archive publicly available.

Finland - National Library

• Follows Swedish approach - only .fi domain initially

• Finnish Copyright Act under revision to permit harvesting web resources

• Uses harvesting software developed in Finland from NEDLIB specification

• Archive Metadata– Uses MD5 checksum for duplicate control,

authentication and create unique access key– Time stamped upon retrieval

Finland - Results of current Harvesting Round (1)

• Harvesting round 2001-2002– Commenced August 2001 - completed in April

2002

– 9.4 million files from 29 million locations (URL’s)

– Compressed data occupied 340 Gbytes of storage

– Stored on a tape robot in national supercomputing centre

– Hardware used: Sun E450 server

FINLAND - Results of current Harvesting Round (2)

• Finnish experience: “the NEDLIB harvester can deal with any national Web space (except perhaps the USA) with reasonably modest hardware, provided that there is sufficient storage space available somewhere”. (Juha Haleka, leader of the Finnish team)

Nordic Web Archive

• Joint project of Nordic national libraries• Not dependent on what harvester is used

– NEDLIB (Finland, Norway, Denmark), COMBINE (Sweden)

• Selected Norwegian search engine (FAST)• Software

– Convert documents from 100 different MIME types to HTML

– Recognises most European languages

• Budget: 260,000 Euros (AUS $475,000)

“The homogeneous (surface) Web”

59.3% - Text/HTML

37.9% - Image (GIF,JPEG,PNG)

1.7% - PDF

1.1% - Other formats

1.5 million HTML

1 million GIF

550,000 JPEG

36,500 PDF

11,800 plain text

6,000 Word

5,300 Java

etc.

DenmarkFinland

United Kingdom (1)

• British Library– “Domain.uk” experiment (commenced 2002)

• Select and capture 100 UK websites (2001 election, GM crops)

• Email selected sites for approval• Revisit every three weeks• Uses Bluesquirrel Web Whacker software• Audit change, loss and links over time

– Intention to scale up (2004 funding bid)

United Kingdom (2)• UKOLN Research Project

– Estimates of size of .uk domain: 3 million sites, 24 million pages

• Wellcome Library/JISC Archiving Study to find a solution to web archiving– The “medical web” – Consultancy awarded March 2002- Completion

date October 2002.– Draft report August 2002. Final report to be

disseminated to the community

Germany

• (Deutsche Bibliotek)– Experiments with targeted

harvesting

– Two incomplete snapshots 12/2000 and 02/2001

France• (Bibliotheque de France)

– In 2001: two experiments with small numbers of sites (16,100) , including music, video and multimedia.

– Unsatisfactory results:• Unexpected features • Exceptionally large sites

– Planning new feasibility study with with 2 different robot providers

– Change in legal deposit law proposed in June 2001. Not yet adopted by Parliament.

Japan

• National Diet Library

• WARP (Web Archiving Program)

• Initially selective

• Major changes in Japanese copyright law expected to permit more comprehensive collecting.

Internet Archive (1)• Founded by Brewster Kahle in 1996 - $15 million from

sale of WAIS• Non-profit organisation.

– Sponsors include AT&T Research, Compaq, Xerox PARC, Quantum DLT, National Science Foundation.

• Archived web pages from 1996+, movies from 1903 to 1973

• Site has archived over 10 billion pages (Oct. 2001) = more than 100 terabytes

• Growth rate : 10 terabytes/month

Internet Archive (2)• Complete sweep of the Web every two months• “Robot exclusions” - many newspapers, individuals,

photographers• Complete copy of Archive at Bibliotheca Alexandrina

(April 2002)• Duplicates in other continents proposed. “Best method of

preservation is replication”. • Copyright ? “May be a massive violation of copyright

law”. (Lawrence Lessig, Stanford University expert on IP law in Cyberspace)

“Wayback machine” - http://www.archive.org

• Front end to the Internet Archive collection of public web pages

• Includes most image files in the collection

• Launched October 2001

• Fully available to public

• 20,000 users/day; up to 200 queries per second

• Not yet text searchable (URL search only)

• Financial sustainability ? (No advertising)

Conclusion

• We’re not here to test laws. We’re trying to build a world we want to live in. The world without a library is a world without memory, and that would be tragic.” B. Kahle, October 2001.

• On the Web, anyone can be a publisher; now there is a library for their work.” B. Kahle, May 2002