Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =

Weighting and Matching against Indices

Zipf’s Law

• In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole = word frequency, F(w).

• Now imagine that we’ve sorted the vocabulary according to frequency, so that the most frequently occurring word will have rank = 1, the next most frequent word will have rank = 2, and so on.

• Zipf (1949) found the following empirical relation:

• F(w) = C / rank(w) to the power α, where α ~ 1, C ~ 20000

• If α = 1, rank * frequency is approx. constant.

Consequences of lexical decisions on word frequencies

• Noise words occur frequently

• “external” keywords also frequent (which tell you what the corpus is about, but do not help index individual documents).

• Zipf’s Law seen with and without stemming.

Token Frequency (stemmed)

Frequency

(unstemmed)

The 78,428

Of 50,026

And 33,834

A 31,347

To 28,666

In 21,512

SYSTEM 21,488 8,632

Is 18,781

MODEL 14,772 4,796

For 14,640

NETWORK 10,306 3,965

This 10,095

BASE 9838

that 9820

Other applications of Zipf’s Law.

• Number of unique visitors vs. rank of website.

• Number of speakers of each Language

• Prize money won by golfers

• Frequency of DNA codons

• Size of avalanches of grains of sand

• Frequency of English surnames

Resolving Power (1)

• Luhn (1957): “It is hereby proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance”.

• If a word is found frequently, more frequently than we would expect, in a document, then it is reflecting emphasis on the part of the author about that document.

• But the raw frequency of occurrence in a document is only one of two critical statistics recommending good keywords.

• For example, almost every article in AIT contains the words ARTIFICIAL INTELLIGENCE

Resolving Power (2)

• Thus we prefer keywords which discriminate between documents (i.e found only in some documents).

• Resolving power is the ability to discriminate content• Mid frequency terms• Luhn did not provide a method of establishing the

maximal and minimal occurrence thresholds• Simple methods – frequency of stop list words = upper

limit, words which appear only once can only index one document.

Exhaustivity and Specificity

• An index is exhaustive if it includes many topics. • An index is specific if users can precisely identify their

information needs. • Trade off: high recall is easiest when an index is exhaustive but

not very specific; high precision is best accomplished when the index is highly specific but not very exhaustive; the best index will strive for a balance.

• If a document is indexed with many keywords, it will be retrieved more often (“representation bias”) – we can expect higher recall, but precision will suffer.

• We can also analyse the problem from a query-oriented perspective – how well do the query terms discriminate one document from another?

Weighting the Index Relation

• The simplest notion of an index is binary – either a keyword is associated with a document or it is not – but it is natural to imagine degrees of aboutness.

• We will use a single real number, a weight, capturing the strength of association between keyword and document.

• The retrieval method can exploit these weights directly.

Weighting (2)

• One way to describe what this weight means is probabilistic. We seek a measure of a document’s relevance, conditionalised on the belief that a keyword is relevant:

• Wkd is proportional to Pr(d relevant | k relevant).• This is a directed relation: we may or may not believe that

the symmetric relation:• Wdk is proportional to Pr(k relevant | d relevant) is the

same.• Unless otherwise specified, when we speak of a weight w

we mean Wkd.

Weighting (3)

• In order to compute statistical estimates for such probabilities we define several important quantities:

• Fkd = number of occurrences of keyword k in document d

• Fk = total number of occurrences of keyword k across the entire corpus

• Dk = number of documents containing keyword k

Weighting (4)

• We will make two demands on the weight reflecting the degree to which a document is about a particular keyword or topic.

• 1. Repetition is an indicator of emphasis. If an author uses a word frequently, it is because she or he thinks it’s important. (Fkd)

• 2. A keyword must be a useful discriminator within then context of the corpus. Capturing this notion statistically is more difficult – for now we just give it the name discrim_k.

• Because we care about both, we will cause our weight to depend on the two factors:

• Wkd α Fkd * discrim_k• Various index weighting schemes exist: they all use Fkd, but

differ in how they quantify discrim_k•

Inverse document frequency (IDF)

• Karen Sparck Jones said that from a discrimination point of view, we need to know the number of documents which contain a particular word.

• The value of a keyword varies inversely with the log of the number of documents in which it occurs:

• Wkd = Fkd * [ log( NDoc / Dk ) + 1]• Where NDoc is the total number of documents in the

corpus.• Variations on this formula exist.

Vector Space Model (1)

• In a library, closely related books are physically close together in three dimensional space.

• Search engines consider the abstract notion of semantic space, where documents about the same topic remain close together.

• We will consider abstract spaces of thousands of dimensions.

• We start with the index matrix relating each document in the corpus to all of its keywords.

• Each and every keyword of the vocabulary is a separate dimension of a vector space. The dimensionality of the vector space is the size of our vocabulary.

Vector Space Model (2)

• In addition to the vectors representing the documents, another vector corresponds to a query.

• Because documents and queries exist within a common vector space, we seek those documents that are close to our query vector.

• A simple (unnormalised) measure of proximity is the inner (or “dot” ) product of query and document vectors:

• Sim( q, d ) = q . d• e.g. [ 1 2 3 ].[10 20 30] = 10 + 40 + 90 = 140

Vector Length Normalisation

• Making weights sensitive to document length• Using the dot product alone, longer documents,

containing more words (more verbose), are more likely to match the query than shorter ones, even if the “scope” (amount of actual information covered) is the same.

• One solution is to use the cosine measure of similarity.

Summary

• Zipf’s law: frequency * rank ~ constant

• Resolving power of keywords: TF * IDF

• Exhaustivity vs. specificity

• Vector space model

• Cosine Similarity measure

Documents

Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =