Upload
habao
View
216
Download
0
Embed Size (px)
Citation preview
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
CM40212: Internet Technology
Julian PadgetTel: x6971, E-mail: [email protected]
Information Retrieval
Acknowledgements: Baeza-Yates & Ribeiro-Neto, James Davenport,Wikipedia
November 22, 2011
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Contents
1 Information Retrieval
2 Pre-processing
3 Indexing
4 Ranking
5 Summary
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Contents
1 Information Retrieval
2 Pre-processing
3 Indexing
4 Ranking
5 Summary
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Information Retrieval
Data vs. Information vs. Knowledge
Information retrieval: useful or relevant information inrespect of a particular query
Aim: retrieve information about a subject, not just datathat satisfies a query
Precision:
Data: all objects satisfy a well-defined condition (regularexpression, relational query)—typically syntacticInformation: most objects have an explicablerelationship to the query (natural language)—typicallysyntactic and semantic
Key metric is relevance
Topics in IR: modelling, document classification andcategorization, systems architecture, user interfaces,data visualization, filtering
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Precision and Recall
Given I (information request), R (relevant documents),A (answer set), then
RA
Ra
Precision is the fraction retrieved that are relevant (i.e.correct results):
Precision =|{relevant} ∩ {retrieved}|
|{retrieved}|=|Ra||A|
Recall is the fraction (of the correct results) that arerelevant to the query:
Recall =|{relevant} ∩ {retrieved}|
|{relevant}|=|Ra||R|
Everything = 100% recall, but useless
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Fall-Out
The fraction of non-relevant documents retrieved, outof all non-relevant documents available:
fall-out =|{non-relevant} ∩ {retrieved}|
|{non-relevant}|=|R ′ ∩ A||R ′|
High fall-out is bad, but 0% fall-out is no good either
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
F-score
A measure of a test’s accuracy. Given precision andrecall as defined above, F1 ∈ [0, 1] is the weightedaverage of p and r , where 1 is good and 0 is bad.
The traditional F-measure or balanced F-score (F1
score) is the harmonic mean of precision and recall:
F = 2 · (precision · recall)/(precision + recall)
The general formula for non-negative real β is:
Fβ = (1 +β2) · (precision · recall)/(β2 · precision + recall)
Two other commonly used F measures are the F2
measure, which weights recall twice as much asprecision, and the F0.5 measure, which weights precisiontwice as much as recall.
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Contents
1 Information Retrieval
2 Pre-processing
3 Indexing
4 Ranking
5 Summary
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Lexical analysis
Characters → words
Four delicate cases:
1 Digits: generally too vague, but when mixed with text(55BC), or 4x4 (credit card), have meaning
2 Hyphens: state of the art ≡ state-of-the-art but can beintegral (F-14, gilt-edge)
3 Punctuation marks: diacritical marks + punctuation cannormally be removed, except in specific circumstances
4 Letter case: usually unimportant except when a noun isidentical to a proper noun
General rule: remove but specify exceptions
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Stopwords
Frequently occurring words are little use asdiscriminators
Called stopwords—normally filtered from index terms
Reduces size of indexing structure
Primary group: articles, prepositions, conjunctions
Secondary group: (some) verbs, adverbs, adjectives
Risks reducing recall, e.g. “to be or not to be”!
Most web search engines use a full text index
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Stemming I
Problem: query specifies word w , but only a variant ofw is present in a relevant document
Perfect query word to document word matching rare
Partial solution is to substitute respective stems
Stemming (conflation) is the process for reducinginflected words to their stem:
Stem may not be identical to the morphological rootUsually sufficient that related words map to the samestemNote approximate nature of process
Wide range of techniques:
1 Brute force: lookup table associating root forms andinflected forms
2 Suffix Stripping: suite of rules for removing the end ofa word, such as:
if the word ends in ’ed’, remove the ’ed’
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Stemming IIif the word ends in ’ing’, remove the ’ing’if the word ends in ’ly’, remove the ’ly’
Porter’s Algorithm is best known: 5 phase processapplying carefully ordered sets of rules. For example
runningranlaughinglaughslaughedlanguishedseals
runranlaughlaughlaughlanguishseal
3 Lemmatisation: first identifies part of speech, thenapplies usage-specific normalization rules—first part isdifficult
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Stemming III4 Stochastic: “learning” algorithm starting from a table
of root forms—decision about which rule or whichcombination to apply aims to minimize probability oferror
5 Hybrid: self-explanatory6 Affix: processes prefixes as well as suffixes
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Contents
1 Information Retrieval
2 Pre-processing
3 Indexing
4 Ranking
5 Summary
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Term selection
Bibliographic approach: use a specialist
Automatic selection works well for less criticalapplications:
Identification of noun groups—carry most of thesemantics?Systematic eliminiation of other parts of speechClustering of adjacent nouns–noun groups—to identifya single concept
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Searching content
Small, volatile content: on-line, sequential scanningOtherwise: build indices to speed up searchThree approaches:
1 Inverted files: best current technology2 Suffix arrays: good for phrase searches, hard to build +
maintain3 Signature files: old technology, now out-performed by
inverted files
Key data structure: trie (from retrieval) or prefix tree
n-ary tree data structure
represents anassociative array (hashtable)
key (usually) string
keys implicit fromposition
Image retrieved from http://en.wikipedia.org/wiki/Trie 20081116
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Inverted files
Inverted file comprises:
1 Vocabulary—the content, usually stemmed words in thiscase
2 Occurrences—positions in the file
Example (from [Baeza-Yates and Ribeiro-Neto, 1999]):
0 1 2 3 4 5 6 7
1234567890123456789012345678901234567890123456789012345678901234567890
This is a text. A text has many words. Words are made from letters.
Vocabulary Occurrencesletters 60made 50many 28text 11, 19words 33, 40
Then build an inverted index using the trie structure
Space gains from breaking text into blocks
... much more to (re-)discover
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
TF-IDF I
TFIDF = term frequency-inverse document frequency
Measure of the importance of a word to a document ina collection of documents
Importance ↗ ∝ freq(w) ∈ document
Importance ↘ ∝ freq(w) ∈ corpus
Basis for scoring and ranking document relevance insearch engines
Term-frequency algorithm: (normalized) number oftimes term t1 occurs in document dj
tfi ,j =ni ,j∑k nk,j
where ni ,j is the number of occurrences ti in dj and thedenominator is the number of occurrences of all k termsin dj
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
TF-IDF IIInverse document frequency measures the importance ofthe term: take the logarithm of the fraction of the totalnumber of documents |D| and those that contain theterm:
idfi = log|D|
|{dj : ti ∈ dj}|TF-IDF is then:
tf-idfi ,j = tfi ,j · idfi
A high weight in TF-IDF is achieved by a high termfrequency (in the given document) and a low documentfrequency of the term in the whole collection ofdocuments
Thus TF-IDF tends to filter out common terms
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Contents
1 Information Retrieval
2 Pre-processing
3 Indexing
4 Ranking
5 Summary
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
The scale of the problem
How many documents?
1994: World Wide Web Worm had index size of110,000 objects1997: numerous search engines, up to 100M objects2000 (predicted): >1B
How many queries?
1994: WWWW got 1500/day1997: Altavista got 20M/day2000 (predicted): >100M/day
From Anatomy of a search engine (Brin & Page, 1998) retrieved 20081116
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
“Google” — a new word? I
I met this woman last night at a party and I cameright home and googled her.
2001 N.Y. Times 11 Mar. III. 12/3
From the Oxford English Dictionary ’s definition of the verb
Definition: Googol
10100 = 10, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000,000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000,000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
“Google” — a new word? II
The name “googol” was invented by a child (Dr.Kasner’s nine-year-old nephew) who was asked to thinkup a name for a very big number, namely, 1 with ahundred zeros after it. Oxford English Dictionary
We chose our system name, Google, because it is acommon spelling of googol, or 10100 and fits well withour goal of building very large-scale search engines. TheAnatomy of a Large-Scale Hypertextual Web SearchEngine by Sergey Brin and Lawrence Page (1998).
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
How does Google choose what to show?
Web Images Videos Maps News Shopping Mail more ▼ Classic Home | Sign in
Google Search I'm Feeling Lucky
Advanced SearchSearch PreferencesLanguage Tools
Search: the web pages from the UK
You have been signed out. Sign in to see your stuff.
(Don't have an iGoogle page? Get started.)
Advertising Programmes - Business Solutions - Privacy Policy - Help - About Google
©2009 Google
iGoogle http://www.google.co.uk/ig?hl=en
1 of 1 22/11/09 18:15
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
“I’m feeling lucky” is often right
Web Images Videos Maps News Shopping Mail more ▼
Web Show options...
Search settings | Sign in
julian padget bath Search Advanced Search
Search: the web pages from the UK
Results 1 - 10 of about 24,200 for julian padget bath. (0.34 seconds)
Julian Padget's home pageUniversity of Bath ... Telephone: +44(1225)386971. FAX: +44(1225)383493. Email:jap_@_cs.bath.ac.uk (remove the "_") ... Julian Padget, [email protected]/~jap/ - Cached - Similar
½Owen CliffeEdgar Casasola, Owen Cliffe, Marina De Vos, and Julian Padget. ... libraryscript A perlWWW::Mecahnize hack to renew library books at the university of Bath ...www.cs.bath.ac.uk/~occ/ - Cached - Similar
Show more results from www.cs.bath.ac.uk
Julian Padget: ZoomInfo Business People InformationJulian Padget, University of Bath, UK Program Chair ... Dr Julian Padget Senior lecturer atthe University of Bath in the Department of Computer Science, ...www.zoominfo.com/people/Padget_Julian_333363313.aspx - Cached
[PDF] Mathematical Service DiscoveryFile Format: PDF/Adobe AcrobatJulian Padget. Department of Computer Science. University of Bath, UK. 20060913 /ENGAGE: Jakarta ... Universit of Bath: Bill Naylor, Julian Padget ...www.eurosoutheastasia-ict.org/events/engage/Indonesia/B5_01_Padget.pdfby J Padget - Related articles - All 9 versions
Julian Padget's home pageMy home page is kept at http://www.cs.bath.ac.uk/~jap. Please click on this link to go there.www.bath.ac.uk/~masjap/home.html - Cached
julian padget bath - Google Search http://www.google.co.uk/search?q=julian+padget+bath&ie=utf-8&oe=ut...
1 of 3 22/11/09 18:11
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
How to decide which pages to choose?
It isn’t luck!
The basic idea is obvious, with hindsight.
Choose the page with more links to it.
A B↓ ↘ ↓C D
Obviously D is more popular than C
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Practice is (much) more complicated
Given:A B↓ ↘ ↓C D↓ ↓E F↓ ↓G H
E and F each have only one link to them, but, since Dis more popular than C , we should regard F as morepopular than E (and H as more popular than G ).
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Practice is (much) more complicated
And constantly changing.
Given:A B↓ ↘ ↓C D↓ ↙ ↓E F↓ ↓G H
Now E is more popular than F
And G is more popular than H
Even though nothing has changed for G itself
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Practice is (much much) more complicated
1 The real Web contains (lots of) loops.
2 The real Web is utterly massive — no-one, not evenGoogle, really knows how big
3 The real Web keeps changing.
4 The real Web is commercially valuable, so there areincentives to manipulate it.
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
The real Web contains loops
Could, in principle write down a set of (linear)equations for the popularity of each page, which woulddepend on the popularity of the pages which linked toit, which would depend on the popularity of the pageswhich linked to it . . . .
Then we could solve these equations.
These equations have a name: they are the equationsfor the principal eigenvector of the connectivity matrixof the Web.
The genius of Brin and Page was to realise that theseequations could be solved, and in a distributed anditerative manner. It’s known as the “Page Rank”algorithm.
Solving these equations is what makes Google work
So it’s not really “I’m feeling lucky”, it’s “I believe ineigenvectors”
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Crawling I
Adapted from Ch.20 of [Manning et al., 2008], retrieved 20081116
Objective: collect pages for indexing
Desirable properties:
Robustness: avoid spider trapsPoliteness: limited impact on serverDistributed: work across multiple machinesScalable: capacity increases at least linearly withresourcesPerformance+efficieny: self-evidentQuality: fetch “useful” pages firstFreshness: operate continuously, but ensure pages arerefreshedExtensible: new data formats, protocols
Process:
1 Initial seed set of URLs
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Crawling II2 Extract hyperlinks from pages and adds them to the
crawl frontier3 Process URLs from crawl frontier recursively according
to control policies:
A selection policy that states which pages to download.A re-visit policy that states when to check for changesto the pages.A politeness policy that states how to avoidoverloading websites.A parallelization policy that states how to coordinatedistributed web crawlers
A prototypical architecture
1 The URL frontier: URLs to (re-)fetch2 A DNS resolution module to determine from which
web-server to get a particular page3 A fetch module to retrieve the page at a URL4 A parsing module to extract the text and links5 A duplicate elimination module (including near
duplicates)
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
PageRank: basics
Inspired by link structure of the web
A
C
B
pages = nodes and links= edges
forward links =outedges
backlinks = inedges
A and B are backlinksof C
a link from page A to page B is a vote from A to B
highly linked pages are more “important” than pageswith few links
backlinks from high PR-pages count more than linksfrom low PR-pages
combination of PR and text-matching techniques resultin highly relevant search results
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
PageRank: definitions
u is a web page
Fu = set of pages u points to
Bu = set of pages pointing to u
c = normalization factor
Nu = |Fu|R(u) = c
∑v∈Bu
R(v)Nv
R induces a (square) matrix A such that
Au,v =
{1/Nu if ∃ edge from u to v0 otherwise
Let R be a vector, then computation is in factR = cAR, and R is an eigenvector of A with eigenvaluec ⇒ iterative solution from a given starting vector R
Technique fails for cycles, e.g. A → B → C → B—arank sink
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
PageRank: algorithm
The cycle problem can be resolved by introducing adamping factor:
R ′ = c(AR ′ + E ) where ||R ′||1 = 1
where E is the initial rank source. Also, rather than theentries in the incidence matrix being 0 or 1, they are in[0, 1] such that a column of A sums to 1.
The algorithm is now an iterative procedure
R0 ← S (1)
loop: (2)
Ri+1 ← ARi (3)
d ← ||Ri ||1 − ||Ri+1||1 (4)
RI+1 ← Ri+1 + dE (5)
δ ← ||Ri+1 − Ri ||1 (6)
while δ > ε (7)
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
PageRank: worked example I
Adapted from http://www.iprcom.com/papers/pagerank/, (not) retrieved20091122
We assume page A has pages T1...Tn which point to it(i.e., are citations). The parameter d is a dampingfactor which can be set between 0 and 1. We usuallyset d to 0.85. Also C (A) is defined as the number oflinks going out of page A. The PageRank of a page A isgiven as follows:
PR(A) = (1−d)+d(PR(T1)/C (T1)+. . .+PR(Tn)/C (Tn))
Note that the PageRanks form a probability distributionover web pages, so the sum of all web pages’PageRanks will be one. PageRank or PR(A) can becalculated using a simple iterative algorithm, andcorresponds to the principal eigenvector of thenormalized link matrix of the web.
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
PageRank: worked example II1 PR(Ti ): each page has a notion of its own importance
2 C (Ti ): each page distributes its vote equally between itsoutgoing links. C (Ti ) is the outgoing link count for Ti
3 PR(Ti )/C (Ti ) measures the share of the vote Areceives from Ti
4 d(· · · ): All votes are summed and damped by 0.85
5 (1− d) makes the sum of all PageRanks equal to 1
A
B
C
D
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
PageRank: worked example III1 a: 0.00000 b: 0.00000 c: 0.00000 d: 0.00000
2 a: 0.15000 b: 0.21375 c: 0.39544 d: 0.15000
3 a: 0.48612 b: 0.35660 c: 0.78721 d: 0.15000
converges after 30+ iterations to
38 a: 1.49011 b: 0.78329 c: 1.57660 d: 0.15000
39 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000
40 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000
main loop in perl:1 while ($i--) {
2 printf(
3 "a: %.5f b: %.5f c: %.5f d: %.5f\n",
4 $a, $b, $c, $d
5 );
6 $a = $norm + $damp * $c;
7 $b = $norm + $damp * ($a/2);
8 $c = $norm + $damp * ($a/2 + $b + $d);
9 $d = $norm;
10 }
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
“Google bomb”
PageRank can be manipulated
Create lots of links to one certain destination
Label all of them with the same terms
Query Google for those terms and you will get thelinked page
For example:<a href="http://www.whitehouse.gov/president/gwbbio.html">Miserable
Failure</a>
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Random vs. Intentional Model
The original PageRank algorithm is grounded in themodel of a “random surfer”
PR(page) ≡ probability of reaching this page byfollowing links randomly
Users follow links by intention, so number of actualvisits reflects the importance of a page—the“intentional surfer”
Such information is captured by the Google toolbar
The nofollow attribute was introduced to combatspamdexing—artificial manufacture of large numbers oflinks to a page to increase its PageRank
Webmasters then use nofollow on outgoing links toincrease own PageRank
Consequence: loss of links for crawling and underminingof the random surfer model
PageRank currently a combination of random andintentional models
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Contents
1 Information Retrieval
2 Pre-processing
3 Indexing
4 Ranking
5 Summary
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Summary
Information retrieval is a large field—treatment hereparticular and superficial
Careful pre-processing required to handle hugevolumnes of data now accessible—even if large amountsare replicated
Capability to search such quantities of data is aserendipitous consequence of a simple, efficient rankingalgorithm combined with the emergent structure of theweb
This material does not stand alone, but it closely linkedwith and inter-dependent with everything else, notablythe capacity to distribute the indexing computation.
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Resources I
“Never believe one source”
Stemming, seehttp://en.wikipedia.org/wiki/Stemming,retrieved 20081116
Martin Porter’s Stemming Algorithm:http://tartarus.org/~martin/PorterStemmer/,retrieved 20081116
But see http://snowball.tartarus.org/
algorithms/english/stemmer.html for the improvedSnowball algorithm, retrieved 20081116
Chinese Web Information Retrieval Forum:http://www.cwirf.org/
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
Resources IIInformation Retrieval (C.J. van Rijsbergen), on-linebook athttp://www.dcs.gla.ac.uk/Keith/Preface.html
Introduction to Information Retrieval, on-line book athttp://www-csli.stanford.edu/~hinrich/
information-retrieval-book.html includeschapters in pdf.
Lucene text retrieval library in Java (and others):lucene.apache.org
The original PageRank paper at: http:
//infolab.stanford.edu/~backrub/google.html
CM40212: InternetTechnology
Julian PadgetTel: x6971,E-mail:
InformationRetrieval
Pre-processing
Indexing
Ranking
Summary
References I
[Baeza-Yates and Ribeiro-Neto, 1999] was a primary sourcefor these notes. More recent and net-accessible is[Manning et al., 2008].
Baeza-Yates, R. and Ribeiro-Neto, B. (1999).Modern Information Retrieval.ACM Press.ISBN 0-201-39829-X.
Manning, C. D., Raghavan, P., and Schutze, H. (2008).Introduction to Information Retrieval.Cambridge University Press.ISBN: 0521865719. Also available from http://www-csli.stanford.
edu/~hinrich/information-retrieval-book.html, retrieved20081116.