Upload
hachi
View
30
Download
1
Embed Size (px)
DESCRIPTION
Challenges in Web Archiving: Library of Congress Edition. Office of Strategic Initiatives. Office of Strategic Initiatives. Abbie Grotke, Web Archiving Team NDIIPP Partner Meeting , July 21, 2010. All Hands Meeting-March 2010. Library of Congress Web Archiving Program. - PowerPoint PPT Presentation
Citation preview
Office of Strategic InitiativesOffice of Strategic Initiatives
All Hands Meeting-March 2010All Hands Meeting-March 2010
Challenges in Web Archiving:
Library of Congress Edition
Abbie Grotke, Web Archiving Team
NDIIPP Partner Meeting, July 21, 2010
Library of Congress Web Archiving Program
p. 1
• 10 years of archiving
• 5 full-time OSI staff on our team, plus 2 contractors, and other IT and Web Services support
• 80+ staff selecting content for our collections: Library Services, Law Library, and Congressional Research Services
• 30+ event and thematic collections
• 12,500+ URLs processed and permissions sent
• 181 TB of content collected
What We Do Pretty Well At This Point
p. 2
• Web Archiving workflows and processes had evolved, and had become more institutionalized
• Improved crawling strategies so we can react more quickly, manage our archive data better, and better serve our customers at LC
• Large-scale contract crawling by Internet Archive
• A move from collection-by-collection crawling to monthly and weekly “crawl buckets”
• Small-scale in-house crawling now available
• tests, emergency crawls
What We Do Pretty Well At This Point
p. xx
• Better tools now to more easily manage our team’s work and all data about various activities: nomination, permissions, crawling, quality review, reporting, etc.
• Automation of manual activities to reduce time spent processing URLs for our nominators and our team
Ongoing Challenges
p. 4
• Selection
• What to select - so many URLs, so little time
• No full-time selection staff, everyone is busy
• Quality Review
• Training to involve Nominators more in the process – “Did we get what you wanted us to get?”
• Team Resources:
• 14 web archive projects actively crawling
• Testing our bandwidth
Ongoing Challenges
p. 5
• Legal
• Permissions: still only about 50% response rate
• Access for Researchers
• Harvesting:
• Collection of specific types of content: rapidly changing news content, YouTube
• Training Nominators re: frequency of collection
• Ramping up in-house crawling (Can we? Should we?)
• The Data:
• How do we transfer this content easily? From IA and within LC
• How do we manage it, store it, and preserve it?
More Information
p. 6
• Web Archiving Team Public Page (about the activity):
http://www.loc.gov/webarchiving/
• Library of Congress Web Archives (our collections):
http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
• Digital Preservation Video on Web Archiving:
http://www.digitalpreservation.gov/videos/webarch09/index.html
• Contact: Abbie Grotke, [email protected]