CM40212: Internet Technology - cs.bath.ac.ukjap/cm40212/slides/08-handout.pdf · CM40212: Internet Technology Julian Padget Tel: x6971, E-mail: [email protected] Information Retrieval

CM40212: InternetTechnology

Julian PadgetTel: x6971,E-mail:

[email protected]

InformationRetrieval

Pre-processing

Indexing

Ranking

Summary

CM40212: Internet Technology

Julian PadgetTel: x6971, E-mail: [email protected]

Information Retrieval

Acknowledgements: Baeza-Yates & Ribeiro-Neto, James Davenport,Wikipedia

November 22, 2011



[email protected]


Pre-processing

Indexing

Ranking

Summary

Contents

1 Information Retrieval

2 Pre-processing

3 Indexing

4 Ranking

5 Summary



[email protected]


Pre-processing

Indexing

Ranking

Summary

Contents


2 Pre-processing

3 Indexing

4 Ranking

5 Summary



[email protected]


Pre-processing

Indexing

Ranking

Summary

Information Retrieval

Data vs. Information vs. Knowledge

Information retrieval: useful or relevant information inrespect of a particular query

Aim: retrieve information about a subject, not just datathat satisfies a query

Precision:

Data: all objects satisfy a well-defined condition (regularexpression, relational query)—typically syntacticInformation: most objects have an explicablerelationship to the query (natural language)—typicallysyntactic and semantic

Key metric is relevance

Topics in IR: modelling, document classification andcategorization, systems architecture, user interfaces,data visualization, filtering



[email protected]


Pre-processing

Indexing

Ranking

Summary

Precision and Recall

Given I (information request), R (relevant documents),A (answer set), then

RA

Ra

Precision is the fraction retrieved that are relevant (i.e.correct results):

Precision =|{relevant} ∩ {retrieved}|

|{retrieved}|=|Ra||A|

Recall is the fraction (of the correct results) that arerelevant to the query:

Recall =|{relevant} ∩ {retrieved}|

|{relevant}|=|Ra||R|

Everything = 100% recall, but useless



[email protected]


Pre-processing

Indexing

Ranking

Summary

Fall-Out

The fraction of non-relevant documents retrieved, outof all non-relevant documents available:

fall-out =|{non-relevant} ∩ {retrieved}|

|{non-relevant}|=|R ′ ∩ A||R ′|

High fall-out is bad, but 0% fall-out is no good either



[email protected]


Pre-processing

Indexing

Ranking

Summary

F-score

A measure of a test’s accuracy. Given precision andrecall as defined above, F1 ∈ [0, 1] is the weightedaverage of p and r , where 1 is good and 0 is bad.

The traditional F-measure or balanced F-score (F1

score) is the harmonic mean of precision and recall:

F = 2 · (precision · recall)/(precision + recall)

The general formula for non-negative real β is:

Fβ = (1 +β2) · (precision · recall)/(β2 · precision + recall)

Two other commonly used F measures are the F2

measure, which weights recall twice as much asprecision, and the F0.5 measure, which weights precisiontwice as much as recall.



[email protected]


Pre-processing

Indexing

Ranking

Summary

Contents


2 Pre-processing

3 Indexing

4 Ranking

5 Summary



[email protected]


Pre-processing

Indexing

Ranking

Summary

Lexical analysis

Characters → words

Four delicate cases:

1 Digits: generally too vague, but when mixed with text(55BC), or 4x4 (credit card), have meaning

2 Hyphens: state of the art ≡ state-of-the-art but can beintegral (F-14, gilt-edge)

3 Punctuation marks: diacritical marks + punctuation cannormally be removed, except in specific circumstances

4 Letter case: usually unimportant except when a noun isidentical to a proper noun

General rule: remove but specify exceptions



[email protected]


Pre-processing

Indexing

Ranking

Summary

Stopwords

Frequently occurring words are little use asdiscriminators

Called stopwords—normally filtered from index terms

Reduces size of indexing structure

Primary group: articles, prepositions, conjunctions

Secondary group: (some) verbs, adverbs, adjectives

Risks reducing recall, e.g. “to be or not to be”!

Most web search engines use a full text index



[email protected]


Pre-processing

Indexing

Ranking

Summary

Stemming I

Problem: query specifies word w , but only a variant ofw is present in a relevant document

Perfect query word to document word matching rare

Partial solution is to substitute respective stems

Stemming (conflation) is the process for reducinginflected words to their stem:

Stem may not be identical to the morphological rootUsually sufficient that related words map to the samestemNote approximate nature of process

Wide range of techniques:

1 Brute force: lookup table associating root forms andinflected forms

2 Suffix Stripping: suite of rules for removing the end ofa word, such as:

if the word ends in ’ed’, remove the ’ed’



[email protected]


Pre-processing

Indexing

Ranking

Summary

Stemming IIif the word ends in ’ing’, remove the ’ing’if the word ends in ’ly’, remove the ’ly’

Porter’s Algorithm is best known: 5 phase processapplying carefully ordered sets of rules. For example

runningranlaughinglaughslaughedlanguishedseals

runranlaughlaughlaughlanguishseal

3 Lemmatisation: first identifies part of speech, thenapplies usage-specific normalization rules—first part isdifficult



[email protected]


Pre-processing

Indexing

Ranking

Summary

Stemming III4 Stochastic: “learning” algorithm starting from a table

of root forms—decision about which rule or whichcombination to apply aims to minimize probability oferror

5 Hybrid: self-explanatory6 Affix: processes prefixes as well as suffixes



[email protected]


Pre-processing

Indexing

Ranking

Summary

Contents


2 Pre-processing

3 Indexing

4 Ranking

5 Summary



[email protected]


Pre-processing

Indexing

Ranking

Summary

Term selection

Bibliographic approach: use a specialist

Automatic selection works well for less criticalapplications:

Identification of noun groups—carry most of thesemantics?Systematic eliminiation of other parts of speechClustering of adjacent nouns–noun groups—to identifya single concept



[email protected]


Pre-processing

Indexing

Ranking

Summary

Searching content

Small, volatile content: on-line, sequential scanningOtherwise: build indices to speed up searchThree approaches:

1 Inverted files: best current technology2 Suffix arrays: good for phrase searches, hard to build +

maintain3 Signature files: old technology, now out-performed by

inverted files

Key data structure: trie (from retrieval) or prefix tree

n-ary tree data structure

represents anassociative array (hashtable)

key (usually) string

keys implicit fromposition

Image retrieved from http://en.wikipedia.org/wiki/Trie 20081116

http://en.wikipedia.org/wiki/Trie



[email protected]


Pre-processing

Indexing

Ranking

Summary

Inverted files

Inverted file comprises:

1 Vocabulary—the content, usually stemmed words in thiscase

2 Occurrences—positions in the file

Example (from [Baeza-Yates and Ribeiro-Neto, 1999]):

0 1 2 3 4 5 6 7

1234567890123456789012345678901234567890123456789012345678901234567890

This is a text. A text has many words. Words are made from letters.

Vocabulary Occurrencesletters 60made 50many 28text 11, 19words 33, 40

Then build an inverted index using the trie structure

Space gains from breaking text into blocks

... much more to (re-)discover



[email protected]


Pre-processing

Indexing

Ranking

Summary

TF-IDF I

TFIDF = term frequency-inverse document frequency

Measure of the importance of a word to a document ina collection of documents

Importance ↗ ∝ freq(w) ∈ document

Importance ↘ ∝ freq(w) ∈ corpus

Basis for scoring and ranking document relevance insearch engines

Term-frequency algorithm: (normalized) number oftimes term t1 occurs in document dj

tfi ,j =ni ,j∑k nk,j

where ni ,j is the number of occurrences ti in dj and thedenominator is the number of occurrences of all k termsin dj



[email protected]


Pre-processing

Indexing

Ranking

Summary

TF-IDF IIInverse document frequency measures the importance ofthe term: take the logarithm of the fraction of the totalnumber of documents |D| and those that contain theterm:

idfi = log|D|

|{dj : ti ∈ dj}|TF-IDF is then:

tf-idfi ,j = tfi ,j · idfi

A high weight in TF-IDF is achieved by a high termfrequency (in the given document) and a low documentfrequency of the term in the whole collection ofdocuments

Thus TF-IDF tends to filter out common terms



[email protected]


Pre-processing

Indexing

Ranking

Summary

Contents


2 Pre-processing

3 Indexing

4 Ranking

5 Summary



[email protected]


Pre-processing

Indexing

Ranking

Summary

The scale of the problem

How many documents?

1994: World Wide Web Worm had index size of110,000 objects1997: numerous search engines, up to 100M objects2000 (predicted): >1B

How many queries?

1994: WWWW got 1500/day1997: Altavista got 20M/day2000 (predicted): >100M/day

From Anatomy of a search engine (Brin & Page, 1998) retrieved 20081116

http://infolab.stanford.edu/~backrub/google.html



[email protected]


Pre-processing

Indexing

Ranking

Summary

“Google” — a new word? I

I met this woman last night at a party and I cameright home and googled her.

2001 N.Y. Times 11 Mar. III. 12/3

From the Oxford English Dictionary ’s definition of the verb

Definition: Googol

10100 = 10, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000,000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000,000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000



[email protected]


Pre-processing

Indexing

Ranking

Summary

“Google” — a new word? II

The name “googol” was invented by a child (Dr.Kasner’s nine-year-old nephew) who was asked to thinkup a name for a very big number, namely, 1 with ahundred zeros after it. Oxford English Dictionary

We chose our system name, Google, because it is acommon spelling of googol, or 10100 and fits well withour goal of building very large-scale search engines. TheAnatomy of a Large-Scale Hypertextual Web SearchEngine by Sergey Brin and Lawrence Page (1998).



[email protected]


Pre-processing

Indexing

Ranking

Summary

How does Google choose what to show?

Web Images Videos Maps News Shopping Mail more ▼ Classic Home | Sign in

Google Search I'm Feeling Lucky

Advanced SearchSearch PreferencesLanguage Tools

Search: the web pages from the UK

You have been signed out. Sign in to see your stuff.

(Don't have an iGoogle page? Get started.)

Advertising Programmes - Business Solutions - Privacy Policy - Help - About Google

©2009 Google

iGoogle http://www.google.co.uk/ig?hl=en

1 of 1 22/11/09 18:15



[email protected]


Pre-processing

Indexing

Ranking

Summary

“I’m feeling lucky” is often right

Web Images Videos Maps News Shopping Mail more ▼

Web Show options...

Search settings | Sign in

julian padget bath Search Advanced Search

Search: the web pages from the UK

Results 1 - 10 of about 24,200 for julian padget bath. (0.34 seconds)

Julian Padget's home pageUniversity of Bath ... Telephone: +44(1225)386971. FAX: +44(1225)383493. Email:jap_@_cs.bath.ac.uk (remove the "_") ... Julian Padget, [email protected]/~jap/ - Cached - Similar

½Owen CliffeEdgar Casasola, Owen Cliffe, Marina De Vos, and Julian Padget. ... libraryscript A perlWWW::Mecahnize hack to renew library books at the university of Bath ...www.cs.bath.ac.uk/~occ/ - Cached - Similar

Show more results from www.cs.bath.ac.uk

Julian Padget: ZoomInfo Business People InformationJulian Padget, University of Bath, UK Program Chair ... Dr Julian Padget Senior lecturer atthe University of Bath in the Department of Computer Science, ...www.zoominfo.com/people/Padget_Julian_333363313.aspx - Cached

[PDF] Mathematical Service DiscoveryFile Format: PDF/Adobe AcrobatJulian Padget. Department of Computer Science. University of Bath, UK. 20060913 /ENGAGE: Jakarta ... Universit of Bath: Bill Naylor, Julian Padget ...www.eurosoutheastasia-ict.org/events/engage/Indonesia/B5_01_Padget.pdfby J Padget - Related articles - All 9 versions

Julian Padget's home pageMy home page is kept at http://www.cs.bath.ac.uk/~jap. Please click on this link to go there.www.bath.ac.uk/~masjap/home.html - Cached

Google

julian padget bath - Google Search http://www.google.co.uk/search?q=julian+padget+bath&ie=utf-8&oe=ut...

1 of 3 22/11/09 18:11



[email protected]


Pre-processing

Indexing

Ranking

Summary

How to decide which pages to choose?

It isn’t luck!

The basic idea is obvious, with hindsight.

Choose the page with more links to it.

A B↓ ↘ ↓C D

Obviously D is more popular than C



[email protected]


Pre-processing

Indexing

Ranking

Summary

Practice is (much) more complicated

Given:A B↓ ↘ ↓C D↓ ↓E F↓ ↓G H

E and F each have only one link to them, but, since Dis more popular than C , we should regard F as morepopular than E (and H as more popular than G ).



[email protected]


Pre-processing

Indexing

Ranking

Summary

Practice is (much) more complicated

And constantly changing.

Given:A B↓ ↘ ↓C D↓ ↙ ↓E F↓ ↓G H

Now E is more popular than F

And G is more popular than H

Even though nothing has changed for G itself



[email protected]


Pre-processing

Indexing

Ranking

Summary

Practice is (much much) more complicated

1 The real Web contains (lots of) loops.

2 The real Web is utterly massive — no-one, not evenGoogle, really knows how big

3 The real Web keeps changing.

4 The real Web is commercially valuable, so there areincentives to manipulate it.



[email protected]


Pre-processing

Indexing

Ranking

Summary

The real Web contains loops

Could, in principle write down a set of (linear)equations for the popularity of each page, which woulddepend on the popularity of the pages which linked toit, which would depend on the popularity of the pageswhich linked to it . . . .

Then we could solve these equations.

These equations have a name: they are the equationsfor the principal eigenvector of the connectivity matrixof the Web.

The genius of Brin and Page was to realise that theseequations could be solved, and in a distributed anditerative manner. It’s known as the “Page Rank”algorithm.

Solving these equations is what makes Google work

So it’s not really “I’m feeling lucky”, it’s “I believe ineigenvectors”



[email protected]


Pre-processing

Indexing

Ranking

Summary

Crawling I

Adapted from Ch.20 of [Manning et al., 2008], retrieved 20081116

Objective: collect pages for indexing

Desirable properties:

Robustness: avoid spider trapsPoliteness: limited impact on serverDistributed: work across multiple machinesScalable: capacity increases at least linearly withresourcesPerformance+efficieny: self-evidentQuality: fetch “useful” pages firstFreshness: operate continuously, but ensure pages arerefreshedExtensible: new data formats, protocols

Process:

1 Initial seed set of URLs



[email protected]


Pre-processing

Indexing

Ranking

Summary

Crawling II2 Extract hyperlinks from pages and adds them to the

crawl frontier3 Process URLs from crawl frontier recursively according

to control policies:

A selection policy that states which pages to download.A re-visit policy that states when to check for changesto the pages.A politeness policy that states how to avoidoverloading websites.A parallelization policy that states how to coordinatedistributed web crawlers

A prototypical architecture

1 The URL frontier: URLs to (re-)fetch2 A DNS resolution module to determine from which

web-server to get a particular page3 A fetch module to retrieve the page at a URL4 A parsing module to extract the text and links5 A duplicate elimination module (including near

duplicates)



[email protected]


Pre-processing

Indexing

Ranking

Summary

PageRank: basics

Inspired by link structure of the web

A

C

B

pages = nodes and links= edges

forward links =outedges

backlinks = inedges

A and B are backlinksof C

a link from page A to page B is a vote from A to B

highly linked pages are more “important” than pageswith few links

backlinks from high PR-pages count more than linksfrom low PR-pages

combination of PR and text-matching techniques resultin highly relevant search results



[email protected]


Pre-processing

Indexing

Ranking

Summary

PageRank: definitions

u is a web page

Fu = set of pages u points to

Bu = set of pages pointing to u

c = normalization factor

Nu = |Fu|R(u) = c

∑v∈Bu

R(v)Nv

R induces a (square) matrix A such that

Au,v =

{1/Nu if ∃ edge from u to v0 otherwise

Let R be a vector, then computation is in factR = cAR, and R is an eigenvector of A with eigenvaluec ⇒ iterative solution from a given starting vector R

Technique fails for cycles, e.g. A → B → C → B—arank sink



[email protected]


Pre-processing

Indexing

Ranking

Summary

PageRank: algorithm

The cycle problem can be resolved by introducing adamping factor:

R ′ = c(AR ′ + E ) where ||R ′||1 = 1

where E is the initial rank source. Also, rather than theentries in the incidence matrix being 0 or 1, they are in[0, 1] such that a column of A sums to 1.

The algorithm is now an iterative procedure

R0 ← S (1)

loop: (2)

Ri+1 ← ARi (3)

d ← ||Ri ||1 − ||Ri+1||1 (4)

RI+1 ← Ri+1 + dE (5)

δ ← ||Ri+1 − Ri ||1 (6)

while δ > ε (7)



[email protected]


Pre-processing

Indexing

Ranking

Summary

PageRank: worked example I

Adapted from http://www.iprcom.com/papers/pagerank/, (not) retrieved20091122

We assume page A has pages T1...Tn which point to it(i.e., are citations). The parameter d is a dampingfactor which can be set between 0 and 1. We usuallyset d to 0.85. Also C (A) is defined as the number oflinks going out of page A. The PageRank of a page A isgiven as follows:

PR(A) = (1−d)+d(PR(T1)/C (T1)+. . .+PR(Tn)/C (Tn))

Note that the PageRanks form a probability distributionover web pages, so the sum of all web pages’PageRanks will be one. PageRank or PR(A) can becalculated using a simple iterative algorithm, andcorresponds to the principal eigenvector of thenormalized link matrix of the web.

http://www.iprcom.com/papers/pagerank/



[email protected]


Pre-processing

Indexing

Ranking

Summary

PageRank: worked example II1 PR(Ti ): each page has a notion of its own importance

2 C (Ti ): each page distributes its vote equally between itsoutgoing links. C (Ti ) is the outgoing link count for Ti

3 PR(Ti )/C (Ti ) measures the share of the vote Areceives from Ti

4 d(· · · ): All votes are summed and damped by 0.85

5 (1− d) makes the sum of all PageRanks equal to 1

A

B

C

D



[email protected]


Pre-processing

Indexing

Ranking

Summary

PageRank: worked example III1 a: 0.00000 b: 0.00000 c: 0.00000 d: 0.00000

2 a: 0.15000 b: 0.21375 c: 0.39544 d: 0.15000

3 a: 0.48612 b: 0.35660 c: 0.78721 d: 0.15000

converges after 30+ iterations to

38 a: 1.49011 b: 0.78329 c: 1.57660 d: 0.15000

39 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000

40 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000

main loop in perl:1 while ($i--) {

2 printf(

3 "a: %.5f b: %.5f c: %.5f d: %.5f\n",

4 $a, $b, $c, $d

5 );

6 $a = $norm + $damp * $c;

7 $b = $norm + $damp * ($a/2);

8 $c = $norm + $damp * ($a/2 + $b + $d);

9 $d = $norm;

10 }



[email protected]


Pre-processing

Indexing

Ranking

Summary

“Google bomb”

PageRank can be manipulated

Create lots of links to one certain destination

Label all of them with the same terms

Query Google for those terms and you will get thelinked page

For example:<a href="http://www.whitehouse.gov/president/gwbbio.html">Miserable

Failure</a>



[email protected]


Pre-processing

Indexing

Ranking

Summary

Random vs. Intentional Model

The original PageRank algorithm is grounded in themodel of a “random surfer”

PR(page) ≡ probability of reaching this page byfollowing links randomly

Users follow links by intention, so number of actualvisits reflects the importance of a page—the“intentional surfer”

Such information is captured by the Google toolbar

The nofollow attribute was introduced to combatspamdexing—artificial manufacture of large numbers oflinks to a page to increase its PageRank

Webmasters then use nofollow on outgoing links toincrease own PageRank

Consequence: loss of links for crawling and underminingof the random surfer model

PageRank currently a combination of random andintentional models



[email protected]


Pre-processing

Indexing

Ranking

Summary

Contents


2 Pre-processing

3 Indexing

4 Ranking

5 Summary



[email protected]


Pre-processing

Indexing

Ranking

Summary

Summary

Information retrieval is a large field—treatment hereparticular and superficial

Careful pre-processing required to handle hugevolumnes of data now accessible—even if large amountsare replicated

Capability to search such quantities of data is aserendipitous consequence of a simple, efficient rankingalgorithm combined with the emergent structure of theweb

This material does not stand alone, but it closely linkedwith and inter-dependent with everything else, notablythe capacity to distribute the indexing computation.



[email protected]


Pre-processing

Indexing

Ranking

Summary

Resources I

“Never believe one source”

Stemming, seehttp://en.wikipedia.org/wiki/Stemming,retrieved 20081116

Martin Porter’s Stemming Algorithm:http://tartarus.org/~martin/PorterStemmer/,retrieved 20081116

But see http://snowball.tartarus.org/

algorithms/english/stemmer.html for the improvedSnowball algorithm, retrieved 20081116

Chinese Web Information Retrieval Forum:http://www.cwirf.org/

http://en.wikipedia.org/wiki/Stemming

http://tartarus.org/~martin/PorterStemmer/

http://snowball.tartarus.org/algorithms/english/stemmer.html

http://snowball.tartarus.org/algorithms/english/stemmer.html

http://www.cwirf.org/



[email protected]


Pre-processing

Indexing

Ranking

Summary

Resources IIInformation Retrieval (C.J. van Rijsbergen), on-linebook athttp://www.dcs.gla.ac.uk/Keith/Preface.html

Introduction to Information Retrieval, on-line book athttp://www-csli.stanford.edu/~hinrich/

information-retrieval-book.html includeschapters in pdf.

Lucene text retrieval library in Java (and others):lucene.apache.org

The original PageRank paper at: http:

//infolab.stanford.edu/~backrub/google.html

http://www.dcs.gla.ac.uk/Keith/Preface.html

http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html


lucene.apache.org





[email protected]


Pre-processing

Indexing

Ranking

Summary

References I

[Baeza-Yates and Ribeiro-Neto, 1999] was a primary sourcefor these notes. More recent and net-accessible is[Manning et al., 2008].

Baeza-Yates, R. and Ribeiro-Neto, B. (1999).Modern Information Retrieval.ACM Press.ISBN 0-201-39829-X.

Manning, C. D., Raghavan, P., and Schutze, H. (2008).Introduction to Information Retrieval.Cambridge University Press.ISBN: 0521865719. Also available from http://www-csli.stanford.

edu/~hinrich/information-retrieval-book.html, retrieved20081116.



Documents

CM40212: Internet Technology - cs.bath.ac.ukjap/cm40212/slides/08-handout.pdf · CM40212: Internet Technology Julian Padget Tel: x6971, E-mail: [email protected] Information Retrieval