7
Websphinx & Webgraph Inf 141 Information Retrieval Winter 2008

Websphinx & Webgraph

Embed Size (px)

DESCRIPTION

Websphinx & Webgraph. Inf 141 Information Retrieval Winter 2008. Assignment 3. See course webpage for specifications Due Friday Feb 8 th Working in groups of 2-3 people Email with subject: Inf 141 Team Registration Train your group - PowerPoint PPT Presentation

Citation preview

Page 1: Websphinx & Webgraph

Websphinx & Webgraph

Inf 141

Information Retrieval

Winter 2008

Page 2: Websphinx & Webgraph

Assignment 3

See course webpage for specifications

Due Friday Feb 8th

Working in groups of 2-3 people Email with subject: Inf 141 Team Registration Train your group

Each member of your group must be able to run your architecture on their own for Assignment 04.

Quiz next wednesday

Page 3: Websphinx & Webgraph

Assignment 3

Page 4: Websphinx & Webgraph

Websphinx

• www.cs.cmu.edu/~rcm/websphinx/

• To write a crawler, extend class Crawler and override shouldVisit () and visit() to create your own crawler.

– visit(): The page is passed to the crawler's visit() method for user-defined processing.

– shouldVisit(Link l): Callback for testing whether a link should be traversed.

• Default returns true for all links.

• Override for other behaviors.

– http://www.cs.cmu.edu/~rcm/websphinx/doc/index.html

Page 5: Websphinx & Webgraph

Websphinx

• Create an array consisting of your seed set of links– Look at the Link Class

• Links to webpage

• Make a link from a string URL

• Make a link from a start tag and end tag

– Look at Page Class• Mainly supports automatically parsed HTML pages

• Parsing produces a list of tags, words, an HTML parse tree, links

• Can make pages

Page 6: Websphinx & Webgraph

Webgraph

• Webgraph is a framework to study the web graph

• Use ArrayListMutableGraph class – Mutable graph class based on IntArrayList– Creates a new mutable graph copying a given immutable

graph ArrayListMutableGraph(ImmutableGraph g) View ImmutableGraph class

• http://webgraph.dsi.unimi.it/docs/

Page 7: Websphinx & Webgraph

?

Questions