Upload
avice-barnett
View
227
Download
3
Embed Size (px)
Citation preview
9:00am – Welcome/Setting the Agenda for the Day
9:10am - 10:30am – Challenges of the Web Now & in the Future
Response to these Challenges
10:30am – BREAK
11:00am - 12:30pm – Intro to Metadata extraction, Data Mining & the Web Archiving Lifecycle
12:30pm – LUNCH
1:30pm - 3pm – Data Mining Breakout sessions/Deep Dives
3pm – BREAK
3:15pm - 4:30pm – Data Mining Breakout sessions/Deep Dives
4:30pm – Wrap-up & Next Steps
Welcome/Agenda
IIPC GA Meeting Ljubljana, Slovenia April 26, 2013
Data Mining & Web Archiving ‘Lifecycles’
Kris Carpenter Negulescu
Internet Archive
IIPC GA Meeting Ljubljana, Slovenia April 26, 20132
Use Cases
Election 2012 CollaborativeNLNZ 2013 Domain and GOV CollectionsWide00002/00005 Crawls
http//home.us.archive.org/~vinay/wide/wide-00002.html
http://home.us.archive.org/~vinay/wide/wide-00005.html
https://webarchive.jira.com/wiki/display/~vinay/Embed+Analysis+for+the+Wide00005+Crawl
IIPC General Assembly, The Hague, May 9, 2011 3
Traditional “Crawl” Lifecycles
CDXs/WATs
WARCsLucene Shards
IIPC GA Meeting Ljubljana, Slovenia April 26, 2013
Analyzing Scope & Quality
IIPC GA Meeting Ljubljana, Slovenia April 26, 2013
Preparing to Collect/Scoping/Framing a Crawl/Collection
Pre “Crawl” WorkflowsTarget identification (beyond curatorial selection…)
• Automated Filtering of Data Sources by Topic, Geo IP, file format, robots policy or other criteria
• Out-link analyses and ranking from selected sources, In-link analyses
• Mining Anchor text/Page Descriptions/Title tags (if not full text)
“Test” Capture Analyses (…routing to proper capture mechanisms)
IIPC General Assembly, The Hague, May 9, 2011 6
Your Browser: Behind the Scenes
IIPC General Assembly, The Hague, May 9, 2011 8
Extracted Metadata & Links (WAT)
WAT is WARC ☺WAT records are WARC
metadata recordsWARC-Refers-To header
identifies original WARC record
WAT payload is JSONCan be combined with
Curator generated metadata
Monitoring/Enhancing/Confirming Capture
Comparing Live Resources to Files WrittenEvaluating Completeness (at all levels)Generating Snapshots of Live and Archived
resourcesEliminating Spam/Detecting Scoping Mistakes
& IssuesMining Crawl Logs (HIVE)Mining Browser LogsMining/Analyzing Links
IIPC General Assembly, The Hague, May 9, 2011 10
Characterizing/Documenting/Preserving Captures & Collections
IIPC General Assembly, The Hague, May 9, 2011 11
Enabling Access & Research
Host profilesLink Graphs, Tag Clouds, & Visualizations
Collection Based: http://home.us.archive.org/~vinay/eot08-explore-data.html
Archive wide: http://home.us.archive.org/~vinay/global/1995-2011/stats.html
http://home.us.archive.org/~vinay/tld.html
Site/Page Evolution http://archive.org/details/TheNewYorkTimesTimelapse1996-2010
Portal Browse/Search http://eotarchive.cdlib.org/
Research Use/Access History Tracker (Weber/Lazer) ARCLink (AlSum/Nelson)
IIPC General Assembly, The Hague, May 9, 2011 12
HistoryTracker Tool
14
Beta Version!
PIG Scripts inHadoop Environment
RU High-Speed Computing Cluster
Link Lists
Curated Data Sets