14
9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK 11:00am - 12:30pm – Intro to Metadata extraction, Data Mining & the Web Archiving Lifecycle 12:30pm – LUNCH 1:30pm - 3pm – Data Mining Breakout sessions/Deep Dives 3pm – BREAK 3:15pm - 4:30pm – Data Mining Breakout sessions/Deep Dives 4:30pm – Wrap-up & Next Steps Welcome/Agenda IIPC GA Meeting Ljubljana, Slovenia April 26, 2013

9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

Embed Size (px)

Citation preview

Page 1: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

9:00am – Welcome/Setting the Agenda for the Day

9:10am - 10:30am – Challenges of the Web Now & in the Future

Response to these Challenges

10:30am – BREAK

11:00am - 12:30pm – Intro to Metadata extraction, Data Mining & the Web Archiving Lifecycle

12:30pm – LUNCH

1:30pm - 3pm – Data Mining Breakout sessions/Deep Dives

3pm – BREAK

3:15pm - 4:30pm – Data Mining Breakout sessions/Deep Dives

4:30pm – Wrap-up & Next Steps

Welcome/Agenda

IIPC GA Meeting Ljubljana, Slovenia April 26, 2013

Page 2: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

Data Mining & Web Archiving ‘Lifecycles’

Kris Carpenter Negulescu

Internet Archive

IIPC GA Meeting Ljubljana, Slovenia April 26, 20132

Page 3: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

Use Cases

Election 2012 CollaborativeNLNZ 2013 Domain and GOV CollectionsWide00002/00005 Crawls

http//home.us.archive.org/~vinay/wide/wide-00002.html

http://home.us.archive.org/~vinay/wide/wide-00005.html

https://webarchive.jira.com/wiki/display/~vinay/Embed+Analysis+for+the+Wide00005+Crawl

IIPC General Assembly, The Hague, May 9, 2011 3

Page 4: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

Traditional “Crawl” Lifecycles

CDXs/WATs

WARCsLucene Shards

IIPC GA Meeting Ljubljana, Slovenia April 26, 2013

Page 5: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

Analyzing Scope & Quality

IIPC GA Meeting Ljubljana, Slovenia April 26, 2013

Page 6: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

Preparing to Collect/Scoping/Framing a Crawl/Collection

Pre “Crawl” WorkflowsTarget identification (beyond curatorial selection…)

• Automated Filtering of Data Sources by Topic, Geo IP, file format, robots policy or other criteria

• Out-link analyses and ranking from selected sources, In-link analyses

• Mining Anchor text/Page Descriptions/Title tags (if not full text)

“Test” Capture Analyses (…routing to proper capture mechanisms)

IIPC General Assembly, The Hague, May 9, 2011 6

Page 7: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

Your Browser: Behind the Scenes

Page 8: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

IIPC General Assembly, The Hague, May 9, 2011 8

Page 9: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

Extracted Metadata & Links (WAT)

WAT is WARC ☺WAT records are WARC

metadata recordsWARC-Refers-To header

identifies original WARC record

WAT payload is JSONCan be combined with

Curator generated metadata

Page 10: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

Monitoring/Enhancing/Confirming Capture

Comparing Live Resources to Files WrittenEvaluating Completeness (at all levels)Generating Snapshots of Live and Archived

resourcesEliminating Spam/Detecting Scoping Mistakes

& IssuesMining Crawl Logs (HIVE)Mining Browser LogsMining/Analyzing Links

IIPC General Assembly, The Hague, May 9, 2011 10

Page 11: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

Characterizing/Documenting/Preserving Captures & Collections

IIPC General Assembly, The Hague, May 9, 2011 11

Page 12: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

Enabling Access & Research

Host profilesLink Graphs, Tag Clouds, & Visualizations

Collection Based: http://home.us.archive.org/~vinay/eot08-explore-data.html

Archive wide: http://home.us.archive.org/~vinay/global/1995-2011/stats.html

http://home.us.archive.org/~vinay/tld.html

Site/Page Evolution http://archive.org/details/TheNewYorkTimesTimelapse1996-2010

Portal Browse/Search http://eotarchive.cdlib.org/

Research Use/Access History Tracker (Weber/Lazer) ARCLink (AlSum/Nelson)

IIPC General Assembly, The Hague, May 9, 2011 12

Page 13: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK
Page 14: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK

HistoryTracker Tool

14

Beta Version!

PIG Scripts inHadoop Environment

RU High-Speed Computing Cluster

Link Lists

Curated Data Sets