31
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work • How do we measure relevance of a search result to a query? • Search engine evaluation. – Content relevance (TF-IDF). – Link-based metrics. – PageRank. – Hits, hubs and authorities. • Search engine evaluation.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Embed Size (px)

Citation preview

Page 1: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.1

Chapter 5 : How Does a Search Engine Work

• How do we measure relevance of a search result to a query?

• Search engine evaluation.– Content relevance (TF-IDF).– Link-based metrics.– PageRank.– Hits, hubs and authorities.

• Search engine evaluation.

Page 2: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.2

Content Relevance - Vector Space Model

Page 3: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.3

Term Frequency (TF)• Count number of occurrences

of each term.• Bag of words approach.• Ignore stopwords such as is,

a, of, the, …• Stemming - computer is

replaced by comput, as are its variants: computers, computing computation,computer and computed.

• Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag.

chess

computer

programming

chess

game

chess

gameis a

Page 4: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.4

Inverse Document Frequency (IDF)

in

Nlog

• N is number of documents in the corpus.

• ni is number of docs in which word i appears.

• Log dampens the effect of IDF.

• IDF is also number of bits to represent the term.

Page 5: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.5

Ranking with TF-IDF

qijij

ijiji

wscore

IDFTFw

,

,,

• i – refers to document i

• j – refers to word (or term) j in doc i

• q – is the query which is a sequence of terms

• scorej - is the score for document j given q

• Rank results according to the scoring function.

Page 6: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.6

Content Relevance

• Phrase matching.

• Synonyms.

• URL analysis.

• Date last updated.

• Spell checking.

• Home page detection.

Page 7: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.7

Link Text (Anchor Text)

• Include link text for a link pointing to a web page, say P, as part of the content of P.

• Link text is very useful in finding home pages.

• Link text behaves like user queries – They act as short summaries.– They often match query terms.

Page 8: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.8 HTML Weighting

Class Name HTML tags

1) Plain Text None of the above

2) Strong STRONG, B, EM, I, U

3) List DL, OL, UL

4) Header H1, H2, H3, H4, H5, H6

5) Anchor A

6) Title TITLE

• Normal retrieval = (111101) ranking with TF-IDF

• (181882) – 39.6% improvement.

• (181782) – 48.3% improvement – C2, C4 and C5.

• (181582) - 43.5% improvement

• Meta tag text is mostly ignored by search engines

Page 9: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.9

Link-Based Metrics

• A link from A to B can be viewed as a recommendation, a vote or a citation.

• Links can be – referential, or – informational

• Links effect the ranking of web pages and thus have commercial value.

Page 10: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.10

Web site to explain PageRank

b1a1

b3

b4

d1d2

e1

e2c1

b2

Page 11: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.11

PageRank - Motivation

• The number incoming links to a page is a measure of importance and authority of the page.

• Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing links are important.

Page 12: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.12

The Random Surfer• Assume the web is a Markov chain.

• Surfers randomly click on links, where the probability of an outlink from page A is 1/m, where m is the number of outlinks from A.

• The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page.

• Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.

Page 13: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.13

Dangling Pages

• Problem: A and B have no outlinks.

• Solution: Assume A and B have links to all web pages with equal probability.

A C B

Page 14: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.14

Rank Sink

• Problem: Pages in a loop accumulate rank but do not distribute it.

• Solution: Teleportation, i.e. with a certain probability the surfer can jump to any other web page to get out of the loop.

Page 15: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.15

PageRank (PR) - Definition

• W is a web page• Wi are the web pages that have a link to P• O(Wi) is the number of outlinks from Pi• T is the teleportation probability• N is the size of the web

)()(

)(...

)(

)(

)(

)()1()(

2

2

1

1

n

n

WO

WPR

WO

WPR

WO

WPRT

N

TWPR

Page 16: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.16

Example web site

Page 17: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.17

Iteratively Computing PageRank• Replace T/N in the def. of PR(W) by T, so PR will take values

between 1 and N.• T is normally set to 0.15, but for simplicity lets set it to 0.5• Set initial PR values to 1• Solve the following equations iteratively:

))(2/)((5.05.0)(

)2/)((5.05.0)(

)(5.05.0)(

BPRAPRCPR

APRBPR

CPRAPR

Page 18: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.18 Example Computation of PR

Iteration PR(A) PR(B) PR(C)

0 1 1 1

1 1 0.75 1.125

2 1.0625 0.765625 1.1484375

3 1.07421875 0.76855469 1.15283203

4 1.07641602 0.76910400 1.15365601

5 1.07682800 0.76920700 1.15381050

… … … …

12 1.07692308 0.76923077 1.15384615

Page 19: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.19

The Largest Matrix Computation in the World

• Computing PageRank can be done via matrix multiplication, where the matrix has over 8 billion rows and columns.

• The matrix is sparse as average number of outlinks is between 7 and 8.

• Setting T = 0.15 or above requires about 100 iterations to convergence.

• Researchers are still trying to speed-up the computation.

Page 20: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.20

Factor in Link Metrics to Relevance of Page

iijiji PRIDFTFw ,, • Multilply by PageRank of document (web page).

• We do not know exactly how Google factors in the PR, it may be that log(PR) is used.

Page 21: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.21 HITS – Hubs and Authorities - Hyperlink-Induced Topic Search

• A on the left is an authority

• A on the right is a hub

Page 22: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.22

Pre-processing for HITS1) Collect the top t pages (say t = 200) based on the

input query; call this the root set.

2) Extend the root set into a base set as follows, for all pages p in the root set:

1) add to the root set all pages that p points to, and

2) add to the root set up-to q pages that point to p (say q = 50).

3) Delete all links within the same web site in the base set resulting in a focused sub-graph.

Page 23: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.23 Expanding the Root Set

Page 24: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.24

HITS Algorithm – Iterate until Convergence

qpBq

pqBq

qApH

qHpA

|

|

)()(

)()(

• B is the base set

• q and p are web pages in B

• A(p) is the authority score for p

• H(p) is the hub score for p

Page 25: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.25

Applications of HITS

• Search engine querying (speed is an issue).

• Finding web communities.

• Finding related pages.

• Populating categories in web directories.

• Citation analysis.

Page 26: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.26

Communities on the Web• A densely linked focused sub-graph of hubs

and authorities is called a community.• Over 100,000 emerging web communities

have been discovered from a web crawl (a process called trawling).

• Alternatively, a community is a set of web pages W having at least as many links to pages in W as to pages outside W.

Page 27: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.27

Weblogs influence on PageRank

• A weblog (or blog) is a frequently updated web site on a particular topic, made up of entries in reverse chronological order.

• Blogs are a rich source of links, and therfore their links influence PageRank.

• A “google bomb” is an attempt to influence the ranking of a web page for a given phrase by adding links to the page with the phrase as its anchor text.

Page 28: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.28

Link Spamming to Improve PageRank

• Spam is the act of trying unfairly to gain a high ranking on a search engine for a web page without improving the user experience.

• Link farms - join the farm by copying a hub page which links to all members.

• Selling links from sites with high PageRank.

Page 29: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.29

Popularity Based Metrics

• Factor in users’ opinions as represented in the query logs.

• Document space modification adjusts the weights of keywords in popular pages.

• Clickthrough data can also be taken into account to improve the ranking of search engine query results.

Page 30: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.30

Evaluating Search Engines

• Precision – top-n precision most important, say for n = 10 (i.e. a page of query results).

• Recall – related to search engine coverage.

• Mean reciprocal rank for Q&A systems.

• Evaluation can be carried out on test collections, e.g. TREC.

Page 31: Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005

Slide 5.31Typical Recall-Precision Curve

• Top-n precision – proportion of relevant pages from top n ranked results.

• Measure top-n precision at fixed recall point for n being 0% to 100% of the ranked results.