41
The Web Archiving Service Tracy Seneca California Digital Library California Digital Library New York University University of North Texas National Digital Information Infrastructure Preservation Program Library of Congress and the Web-at-Risk NDIIPP Project

The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Embed Size (px)

Citation preview

Page 1: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

The Web Archiving Service

Tracy SenecaCalifornia Digital Library

California Digital Library New York University University of North Texas

National Digital Information Infrastructure Preservation ProgramLibrary of Congress

and the Web-at-Risk NDIIPP Project

Page 2: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Overview

1. Web archiving: what & why

2. Web-at-Risk grant: scope & purpose

3. Web Archiving Service Sample Screens

Page 3: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Web archiving: what & why

Page 4: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

“Web Archiving”: Assumptions

• Using automated methods to gather web content

• Building some kind of collection composed of more than one site

• Intent on preserving captured content

• Results are searchable– Public access may not be available

Page 5: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

How is the material at risk?

• Vulnerability of– Digital publications– Web publications– Government web publications– Local government web publications

Page 6: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

The Ephemeral Web

Page 7: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Issues Unique to Government and Political Web Documents

• Publication & notification streams

• Elections, political change

• Security vs. freedom of information

• Local agencies often don’t have the resources to archive their own publications

Page 8: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Web-at-Risk grant: scope & purpose

Page 9: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Grant ScopeJan 2005 – Jun 2009

• Build tools to allow librarians to capture, curate and preserve web-based government and political information.– Create topical and event-based archives– Capture individual sites and documents

• Assess the impact of these tools on traditional collection development practices.

• Explore web archiving service sustainability.

Page 10: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Project Partners

Page 11: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Web-at-Risk Collections

Page 12: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Beyond the Grant

• Support web archiving for the University of California– Enable collaboration across campuses– Enable collaboration between librarians and

researchers/faculty

Page 13: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Web Archiving Service (WAS)

• Tangible outcome of grant work

• Being developed and release over a series of pilot tests

• Pilot test 5 underway until May 23

• 2008-2009 develop rights management and public access features

Page 14: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

WAS Production

• Early summer 2008, Web Archiving Service goes into ‘limited’ production.– Available 24/7 to the curators who have taken

part in the pilot tests so far

• Expand user community within UC as CDL confirms that WAS infrastructure, user support and training is sufficient.

Page 15: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Web Archiving ServiceWorkflow and Sample Screens

Page 16: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

WAS workflowProject > Site > Capture > Collection

• Set up a project (usually a topic or event)

• Define the sites to capture

• Run single or multiple captures of each site

• Choose which results to add to a single, searchable collection

Page 17: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital
Page 18: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Capture sites individually

Page 19: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Set Frequency

Page 20: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Add metadata (or not)

Page 21: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital
Page 22: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Sites can be captured in batches

Page 23: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

When Capture Finishes

Page 24: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital
Page 25: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Display Results(QA capture effectiveness)

Page 26: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Display Results: Overview & Reports

Page 27: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Display Results: Full Text Search

Page 28: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Display Results

Page 29: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Display Results(metadata)

Page 30: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital
Page 31: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Create Collection

Page 32: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Build Collection(add entire captures)

Page 33: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Build Collection

Page 34: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

WAS features for analysis

• It’s impossible to know what a web site ‘contains’ until after you capture it!

• Tools for understanding where the data comes from and how it has changed.

Page 35: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

What’s the nature of this content?

Page 36: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

What new publications are in this capture?

Page 37: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Build Collection(Select files from “Compare” screen)

Page 38: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

How volatile is this site?(Not yet available)

Page 39: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Potential

• We can now capture the “chit chat” – the popular reaction to historic events, in ways never before possible.

• How will researchers interact with captured content once it is in an archive?– Visualization– Text analysis

• What is the potential, beyond simple search and display?

Page 40: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Web Archive VisualizationDoantam Phan – Stanford University

Page 41: The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital

Questions?

Web-at-Risk Wiki

http://wiki.cdlib.org/WebAtRisk

You Tube Video: “Web-at-Risk Collections”

[email protected]