9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now...

Preview:

Citation preview

9:00am – Welcome/Setting the Agenda for the Day

9:10am - 10:30am – Challenges of the Web Now & in the Future

Response to these Challenges

10:30am – BREAK

11:00am - 12:30pm – Intro to Metadata extraction, Data Mining & the Web Archiving Lifecycle

12:30pm – LUNCH

1:30pm - 3pm – Data Mining Breakout sessions/Deep Dives

3pm – BREAK

3:15pm - 4:30pm – Data Mining Breakout sessions/Deep Dives

4:30pm – Wrap-up & Next Steps

Welcome/Agenda

IIPC GA Meeting Ljubljana, Slovenia April 26, 2013

Data Mining & Web Archiving ‘Lifecycles’

Kris Carpenter Negulescu

Internet Archive

IIPC GA Meeting Ljubljana, Slovenia April 26, 20132

Use Cases

Election 2012 CollaborativeNLNZ 2013 Domain and GOV CollectionsWide00002/00005 Crawls

http//home.us.archive.org/~vinay/wide/wide-00002.html

http://home.us.archive.org/~vinay/wide/wide-00005.html

https://webarchive.jira.com/wiki/display/~vinay/Embed+Analysis+for+the+Wide00005+Crawl

IIPC General Assembly, The Hague, May 9, 2011 3

Traditional “Crawl” Lifecycles

CDXs/WATs

WARCsLucene Shards

IIPC GA Meeting Ljubljana, Slovenia April 26, 2013

Analyzing Scope & Quality

IIPC GA Meeting Ljubljana, Slovenia April 26, 2013

Preparing to Collect/Scoping/Framing a Crawl/Collection

Pre “Crawl” WorkflowsTarget identification (beyond curatorial selection…)

• Automated Filtering of Data Sources by Topic, Geo IP, file format, robots policy or other criteria

• Out-link analyses and ranking from selected sources, In-link analyses

• Mining Anchor text/Page Descriptions/Title tags (if not full text)

“Test” Capture Analyses (…routing to proper capture mechanisms)

IIPC General Assembly, The Hague, May 9, 2011 6

Your Browser: Behind the Scenes

IIPC General Assembly, The Hague, May 9, 2011 8

Extracted Metadata & Links (WAT)

WAT is WARC ☺WAT records are WARC

metadata recordsWARC-Refers-To header

identifies original WARC record

WAT payload is JSONCan be combined with

Curator generated metadata

Monitoring/Enhancing/Confirming Capture

Comparing Live Resources to Files WrittenEvaluating Completeness (at all levels)Generating Snapshots of Live and Archived

resourcesEliminating Spam/Detecting Scoping Mistakes

& IssuesMining Crawl Logs (HIVE)Mining Browser LogsMining/Analyzing Links

IIPC General Assembly, The Hague, May 9, 2011 10

Characterizing/Documenting/Preserving Captures & Collections

IIPC General Assembly, The Hague, May 9, 2011 11

Enabling Access & Research

Host profilesLink Graphs, Tag Clouds, & Visualizations

Collection Based: http://home.us.archive.org/~vinay/eot08-explore-data.html

Archive wide: http://home.us.archive.org/~vinay/global/1995-2011/stats.html

http://home.us.archive.org/~vinay/tld.html

Site/Page Evolution http://archive.org/details/TheNewYorkTimesTimelapse1996-2010

Portal Browse/Search http://eotarchive.cdlib.org/

Research Use/Access History Tracker (Weber/Lazer) ARCLink (AlSum/Nelson)

IIPC General Assembly, The Hague, May 9, 2011 12

HistoryTracker Tool

14

Beta Version!

PIG Scripts inHadoop Environment

RU High-Speed Computing Cluster

Link Lists

Curated Data Sets

Recommended