Upload
isabelle-hopper
View
41
Download
2
Embed Size (px)
DESCRIPTION
Websphinx & Webgraph. Inf 141 Information Retrieval Winter 2008. Assignment 3. See course webpage for specifications Due Friday Feb 8 th Working in groups of 2-3 people Email with subject: Inf 141 Team Registration Train your group - PowerPoint PPT Presentation
Citation preview
Websphinx & Webgraph
Inf 141
Information Retrieval
Winter 2008
Assignment 3
See course webpage for specifications
Due Friday Feb 8th
Working in groups of 2-3 people Email with subject: Inf 141 Team Registration Train your group
Each member of your group must be able to run your architecture on their own for Assignment 04.
Quiz next wednesday
Assignment 3
Websphinx
• www.cs.cmu.edu/~rcm/websphinx/
• To write a crawler, extend class Crawler and override shouldVisit () and visit() to create your own crawler.
– visit(): The page is passed to the crawler's visit() method for user-defined processing.
– shouldVisit(Link l): Callback for testing whether a link should be traversed.
• Default returns true for all links.
• Override for other behaviors.
– http://www.cs.cmu.edu/~rcm/websphinx/doc/index.html
Websphinx
• Create an array consisting of your seed set of links– Look at the Link Class
• Links to webpage
• Make a link from a string URL
• Make a link from a start tag and end tag
– Look at Page Class• Mainly supports automatically parsed HTML pages
• Parsing produces a list of tags, words, an HTML parse tree, links
• Can make pages
Webgraph
• Webgraph is a framework to study the web graph
• Use ArrayListMutableGraph class – Mutable graph class based on IntArrayList– Creates a new mutable graph copying a given immutable
graph ArrayListMutableGraph(ImmutableGraph g) View ImmutableGraph class
• http://webgraph.dsi.unimi.it/docs/
?
Questions