1 CS 430: Information Discovery Lecture 5 Ranking

1

CS 430: Information Discovery

Lecture 5

Ranking

2

Course Administration

• Optional course readings are optional. Read them if you wish. Some may require a visit to a library!

• Teaching assistants do not have office hours. If your query cannot be addressed by email, ask to meet with them or come to my office hours.

• Assignment 1 is an individual assignment. Discuss the concepts and the choice of methods with your colleagues, but the actual programs and report much be individual work.

3

Course Administration

Hints on Assignment 1

• You are not building a production system!!! • The volume of test data is quite small.

Therefore

Choose data structures, etc. that illustrate the concepts but are straightforward to implement (e.g., do not implement B trees).

Consider batch loading of data (e.g., no need to provide for incremental update).

User interface can be minimal (e.g., single letter commands).

To save typing, we will provide the arrays char_class and convert_class from Frake Chapter 7.

4

Term Frequency

Concept

A term that appears many times within a document is likely to be more important than a term that appears only once.

5

Term Frequency

Suppose term j appears fij times in document i

Simple method (as illustrated in Lecture 4) is to use fij as the term frequency.

Standard method

Scale fij relative to the other terms in the document. This partially corrects for variations in the length of the documents.

Let mi = max (fij) i.e., mi is the maximum frequency of any term in document i

Term frequency (tf):

tfij = fij / mi

i

6

Inverse Document Frequency

Concept

A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

7

Inverse Document Frequency

Suppose there are n documents and that the number of documents in which term j occurs is dj.

Simple method is to use n/dj as the inverse document frequency.

Standard method

The simple method over-emphasizes small differences. Therefore use a logarithm. Inverse document frequency (idf):

idfj = log2 (n/dj) + 1 dj > 0

8

Example of Inverse Document Frequency

Examplen = 1,000 documents

term j dj idfj

A 100 4.32 B 500 2.00 C 900 1.13 D 1,000 1.00

From: Salton and McGill

9

Standard Version of tf.idf Weighting

Combining tf and idf:

(a) Weight is proportional to the number of times that the term appears in the document.

(b) Weight is proportional to the logarithm of the reciprocal of the number of documents that contain the term.

Notation

wij is the weight given to term j in document i fij is the frequency with which term j appears in document i

dj is the number of documents that contain term jmi is the maximum frequency of any term in document in is the total number of documents

10

Standard Form of tf.idf

Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances:

(Weight of term j in document i)

= (Term frequency) * (Inverse document frequency)

The standard tf.idf weighting scheme is:

wij = tfij * idfj

= (fij / mi) * (log2 (n/dj) + 1)

Frake, Chapter 14 discusses many variations on this basic scheme.

11

Ranking Based on Reference Patterns

With term weighting (e.g., tf.idf) documents are ranked depending on how well they match a specific query.

With ranking by reference patterns, documents are ranked based on the references among them. The ranking of a set of documents is independent of any specific query.

In journal literature, references are called citations.

On the web, references are called links or hyperlinks.

12

13

Citation Graph

Paper

cites

is cited by

Note that journal citations always refer to earlier work.

14

Bibliometrics

Techniques that use citation analysis to measure the similarity of journal articles or their importance

Bibliographic coupling: two papers that cite many of the same papers

Co-citation: two papers that were cited by many of the same papers

Impact factor (of a journal): frequency with which the average article in a journal has been cited in a particular year or period

15

Graphical Analysis of Hyperlinks on the Web

This page links to many other pages

Many pages link to this page

12

34

5 6

16

Matrix Representation

P1 P2 P3 P4 P5 P6 Number

P1 1 1

P2 1 1 2

P3 1 1 1 3

P4 1 1 1 1 4

P5 1 1

P6 1 1

Cited page (to)

Citing page (from)

Number 4 2 1 1 3 1

17

PageRank Algorithm (Google)

Concept:

The rank of a web page is higher if many pages link to it.

Links from highly ranked pages are given greater weight than links from less highly ranked pages.

18

Intuitive Model

A user:

1. Starts at a random page on the web

2. Selects a random hyperlink from the current page and jumps to the corresponding page

3. Repeats Step 2 a very large number of times

Pages are ranked according to the relative frequency with which they are visited.

19

Basic Algorithm: Normalize by Number of Links from Page

P1 P2 P3 P4 P5 P6

P1 0.33

P2 0.25 1

P3 0.25 0.5 1

P4 0.25 0.5 0.33 1

P5 0.25

P6 0.33

Cited page

Citing page

Number 4 2 1 1 3 1

= BNormalized link matrix

20

Basic Algorithm: Weighting of Pages

Initially all pages have weight 1

w1 = 1

1

1

1

1

1

Recalculate weights

w2 = Bw1 =

0.33

1.25

1.75

2.08

0.25

0.33

21

Basic Algorithm: Iterate

Iterate: wk = Bwk-1

0.33

1.25

1.75

2.08

0.25

0.33

0.08

1.83

2.79

1.12

0.08

0.08

0.03

2.80

2.06

1.05

0.02

0.03

->

->

->

->

->

->

0.00

2.39

2.39

1.19

0.00

0.00

1

1

1

1

1

1

w1 w2 w3 w4 ... converges to ... w

22

Google PageRank with Damping

A user:

1. Starts at a random page on the web

2a. With probability p, selects any random page and jumps to it

2b. With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page

3. Repeats Step 2a and 2b a very large number of times

Pages are ranked according to the relative frequency with which they are visited.

23

The PageRank Iteration

The basic method iterates using the normalized link matrix, B.

wk = Bwk-1

This w is the high order eigenvector of B

Google iterates using a damping factor. The method iterates using a matrix B', where:

B' = pN + (1 - p)B

N is the matrix with every element equal to 1/n.p is a constant found by experiment.

24

Google: PageRank

The Google PageRank algorithm is usually written with the following notation

If page A has pages Ti pointing to it.– d: damping factor– C(A): number of links out of A

Iterate until:

n

i i

i

TCTPddAP

1

1

25

Information Retrieval Using PageRank

Simple Method

Consider all hits (i.e., all document vectors that share at least one term with the query vector) as equal.

Display the hits ranked by PageRank.

The disadvantage of this method is that it gives no attention to how closely a document matches a query

26

Reference Pattern Ranking using Dynamic Document Sets

PageRank calculates document ranks for the entire (fixed) set of documents. The calculations are made periodically (e.g., monthy) and the document ranks are the same for all queries.

Concept. Reference patterns among documents that are related to a specific query convey more information than patterns calculated across entire document collections.

With dynamic document sets, references patterns are calculated for a set of documents that are selected based on each individual query.

27

Reference Pattern Ranking using Dynamic Document Sets

Teoma Dynamic Ranking Algorithm (used in Ask Jeeves)

1. Search using conventional term weighting. Rank the hits using similarity between query and documents.

2. Select the highest ranking hits (e.g., top 5,000 hits).

3. Carry out PageRank or similar algorithm on this set of hits. This creates a set of document ranks that are specific to this query.

4. Display the results ranked in the order of the reference patterns calculated.

28

Combining Term Weighting with Reference Pattern Ranking

Combined Method

1. Find all documents that share a term with the query vector.

2. The similarity, using conventional term weighting, between the query and document j is sj.

3. The rank of document j using PageRank or other reference pattern ranking is pj.

4. Calculate a combined rank cj = λsj + (1- λ)pj, where λ is a constant.

5. Display the hits ranked by cj.

This method is used in several commercial systems, but the details have not been published.

29

Cornell Note

Jon Kleinberg of Cornell Computer Science has carried out extensive research in this area, both theoretical and practical development of new algorithms. In particular he has studied hubs (documents that refer to many others) and authorities (documents that are referenced by many others).

Documents

1 CS 430: Information Discovery Lecture 5 Ranking