38

Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Embed Size (px)

Citation preview

Page 1: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009
Page 2: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Open Inside:The Open Source Tools

that Power Archive-It

Archive-It Partners 2009

Gordon Mohr, Internet Archive

November 4, 2009

Page 3: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Archive-It Unifies Many Tools

Archive-It: managing, designing, monitoring, scheduling, reporting

Integrated Tools: collecting, storing, displaying, searching

Page 4: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Open Source & Standards from IA

• 3 open source software projects– Heritrix

collecting– Wayback

displaying– NutchWAX

searching

• 1 co-developed ISO standard– WARC File Format

storing

Page 5: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Open Source from Elsewhere

• Linux

• Apache/Tomcat

• MySQL

• Lucene-Nutch-Hadoop

Page 6: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Why Open Source?

• Open Source Initiative says: “Open source is a development method for software that harnesses the power of distributed peer review and transparency of process. The promise of open source is better quality, higher reliability, more flexibility, lower cost, and an end to predatory vendor lock-in.”

• More than access to source code: Right to change, reuse, extend

• Wins: – Harmonize formats, practices– Avoid duplication of effort– Reduce costs

Page 7: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Projects Genesis: 2003

• Internet Archive wanted more control over its own software & collections

• Discussions with national librariesUSA, Canada, UK, France, Iceland, Sweden, Norway, Finland,

Denmark, Italy, Australia

• Desire to share tools, formats, experiencesavoid duplicated effort, closed & inflexible tools

• Formed:International Internet Preservation Consortium (IIPC)

http://www.netpreserve.org

Page 8: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Heritrix

Page 9: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

What is Heritrix?

Open-source

Extensible

Web-scale

Archival-quality

Web crawling software

http://crawler.archive.org

Page 10: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Heritrix Motivations

• Deeper, specialized, in-house crawling• Open source

– Encourage collaboration on features and best practices

– Avoid duplication of work, incompatibilities

• Archival-quality– Perfect copies– Keep up with changing web– Meet evolving needs of Internet Archive and

International Internet Preservation Consortium

Page 11: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Heritrix Overview

• Heritrix means heiress• Java, modular• Project website:

http://crawler.archive.org– News, downloads, documentation, issue-tracking– Sourceforge: open source hosting site

• Source-code control (SVN)• Official downloads

• “Lesser” GPL or Apache license – easy reuse• Outside contributions welcome

Page 12: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Milestones

• 1.0 release in March 2004• Major releases since:

– 1.2 new scope options (2004)– 1.4 improved memory use (2005)– 1.6 remote control (2005)– 1.8 scaling (2006)– 1.10 protocols, formats, fixes (2006)– 1.12 “smart” duplicate reduction (2007)– 2.0 “smart” prioritization (2008)– 1.14 WARC, performance (2008-2009)

Page 13: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Archive-It Uses Heritrix 1.14.3+

• AKA “1.15.4”

• WARC/1.0

• Many minor fixes

• Same as all contract/national crawls

• Available as developer build

• Will become 1.14.4

Page 14: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Heritrix – future

• Next major release: Heritrix 3.0– Crawl configuration by ‘Spring’– Scriptable configuration– Web-service remote control

• Other upcoming priorities– “Smart” continuous/automatic revisits (3.2)

(from change detection to prediction)

– Rich media improvements – Spam/trap/mirror suppression– Automate ever-larger crawls

Page 15: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Heritrix – more info

• Project website– http://crawler.archive.org

• Source code– Sourceforge ‘SVN’

• Discussion– http://tech.groups.yahoo.com/group/archive-crawle

r/

• Issues/Bugs– http://webarchive.jira.com/browse/HER

• Key IA staff:– Steve Sisney, Gordon Mohr

Page 16: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Wayback

Page 17: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

What is Wayback?

Open Source

Java

Modular

Scalable

Customizable

Web Archive Access Tool

http://archive-access.sourceforge.net/projects/wayback

Page 18: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Wayback – the beginning

• Inception in 2005– Aim: URL-based browsing ‘as if’ at previous dates– Contrasts with classic:

• Open source, diverse installs• Java vs. Perl/C• Refactored:

– Many extension points

– Basis for new features & experiments

• First release: “0.2.0” December 2005Now at 1.4.2 (July 2009)

Page 19: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Wayback Features

• Starting with an URL:– See list of captures by date– See extension URLs (same site)– View a capture

• Once browsing (“replay”):– Browse web ‘as it was’– Best-match clickthroughs

Page 20: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Wayback: Modular Components

• Query User Interface– Calendar, Search Engine, XML

• Replay User Interface– Archival URL, Timeline, Proxy

• Resource Index– CDX, BDB, Remote, Nutch, Aggregated

• Resource Store– Local ARC, HTTP 1.1 Remote ARC

Page 21: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Archive-It Uses Wayback 1.4.2+

• UI customized

• Adds server-side rewriting-mode

• Available from project source-control

• Next major release: 1.6.0

Page 22: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Wayback – more info

• Website– http://archive-access.sourceforge.net/projects/wayback/

• Source code– Sourceforge ‘SVN’

• Discussion– https://lists.sourceforge.net/lists/listinfo/archive-access-discu

ss

• Issues/Bugs– https://webarchive.jira.com/browse/ACC

• Key IA staff:– Brad Tofel

Page 23: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

NutchWAX

Page 24: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

What is NutchWAX?

Open Source

Java

Full-Text Indexing

End-User Querying

for Web Archives

Built on Lucene/Nutch/Hadoop

http://archive-access.sourceforge.net/projects/nutch

Page 25: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

NutchWAX Background

• Lucene– Open-source Java full-text indexing– Popular, mature

• Nutch– Extensions to Lucene – For web content, access, scale

• Hadoop– Spun off from Nutch– Inspired by Google’s Map-Reduce

Page 26: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

NutchWAX

• Inception in 2005

• Nutch Web Archive eXtensions– Utilities for using (W)ARCs as Nutch input– Configuration for date dimension– Handle repeated URLs

• First release – “0.2.1” – July 2005– Now at 0.12.8 (September 2009)

Page 27: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Archive-It Uses NutchWAX 0.12.8

• Latest official release

• Recent changes driven by Archive-It – Caching support– Index maintenance processes (merging)– ‘Reboost’ for reranking

Page 28: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

NutchWAX – more info

• Website– http://archive-access.sourceforge.net/projects/nutchwax/

• Source code– Sourceforge ‘SVN’

• Discussion– https://lists.sourceforge.net/lists/listinfo/archive-access-discu

ss

• Issues/Bugs– https://webarchive.jira.com/browse/WAX

• Key IA staff:– Aaron Binns

Page 29: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

WARC

Page 30: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

What is WARC?

IIPC

ISO

Standard

Flexible

Simple

Format for Web Archive Files

http://tinyurl.com/2eusle (drafts)

Page 31: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

WARC Overview

• WARC = Web ARChive file format

• Next generation of ARC, called for by IIPC– ARC format created by the Internet Archive– Over 1PB of ARCs gathered since 1996

Page 32: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

WARC Goals

• Store arbitrary metadata (e.g., subject classifier, discovered language, encoding)

• Data compression and record integrity• Store all control information from the harvesting

protocol (e.g., request headers)• Store the results of data migrations • Store a duplicate detection event• Distinguishable from the legacy ARC• Globally unique record identifiers• Deterministic handling of long records (e.g.,

truncation, segmentation).

Page 33: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

ARC vs. WARC

• Both are a simple sequence of content blocks, each introduced by a small text header

• ARCs only 1-line header + protocol response• WARCs add:

– multi-line header with extensible fields– New record types:

• Request, Response, Resource• Metadata, Revisit, Conversion, Warcinfo, Continuation

Page 34: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

What does the future hold?

Page 35: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

What does the future hold?

Expand and improve toolset

– Driven by user requests, contributions, sponsors

– Unify access tools

– Verify and improve internationalization

Page 36: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

What does the future hold?

Keep up with the web

– New formats, protocols, design techniques

– Content challenges: • Deep content• Spam• Interactive applications / AJAX / Javascript

Page 37: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Thank You

Gordon Mohr

Internet Archive Web Group

[email protected]

Page 38: Open Inside: The Open Source Tools that Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

Thank You

Gordon Mohr

Internet Archive Web Group

[email protected]