57
Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Embed Size (px)

Citation preview

Page 1: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Indexing Implementation and Indexing Models

CSC 575

Intelligent Information Retrieval

Page 2: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

Lexical analysis and stop words

ResultSets

How isthe index

constructed?

Page 3: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 3

Indexing Implementationi Bitmaps

4 For each term, allocate vector with 1 bit per document4 If feature present in document n, set nth bit to 1, otherwise 04 Boolean operations very fast4 Space efficient for common terms, but inefficient for rare terms (why?)4 Difficult to add/delete documents (why?)4 Not widely used

i Signature files (Also called superimposed coding)4 For each term, allocate fixed size s-bit vector (signature) 4 Define hash function: word 1..2s 4 Each term then has s-bit signature (may not be unique) 4 OR the term signatures to form document signature4 Lookup signature for query term. If all corresponding 1-bits on in document

signature, document probably contains that term

i Inverted files …

Page 4: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 4

Indexing Implementationi Inverted files

4 Primary data structure for text indexes4 Source file: collection, organized by document4 Inverted file: collection organized by term (one record per term, listing

locations where term occurs)4 Query: traverse lists for each query term

i OR: the union of component listsi AND: an intersection of component lists

4 Based on the view of documents as vectors in n-dimensional spacei n = number of index terms used for indexingi Each document is a bag of words (vector) with a direction and a magnitudei The Vector-Space Model for IR

Page 5: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

The Vector Space Model

i Vocabulary V = the set of terms left after pre-processing the text (tokenization, stop-word removal, stemming, ...).

i Each document or query is represented as a |V| = n dimensional vector:4 dj = [w1j, w2j, ..., wnj].

4 wij is the weight of term i in document j.

the terms in V form the orthogonal dimensions of a vector space

i Document = Bag of words:4 Vector representation doesn’t consider the ordering of words:

i John is quicker than Mary vs. Mary is quicker than John.

5

Page 6: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

The dictiona

ry

Intelligent Information Retrieval 6

Document Vectors and Indexesi Conceptually, the index can be viewed as a document-

term matrix 4 Each document is represented as an n-dimensional vector (n = no. of terms in

the dictionary)4 Term weights represent the scalar value of each dimension in a document4 The inverted file structure is an “implementation model” used in practice to

store the information captured in this conceptual representation

nova galaxy heat hollywood film role diet fur

A 1.0 0.5 0.3

B 0.5 1.0

C 1.0 0.8 0.7

D 0.9 1.0 0.5

E 1.0 1.0

F 0.9 1.0

G 0.5 0.7 0.9

H 0.6 1.0 0.3 0.2 0.8

I 0.7 0.5 0.1 0.3

Document Ids

a documentvector

Term Weights(in this case normalized)

Page 7: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 7

Example: Documents and Query in 3D Space

i Documents in term space4 Terms are usually stems4 Documents (and the query) are represented as vectors of terms

i Query and Document weights4 based on length and direction of their vector

i Why use this representation?4 A vector distance measure between the query and documents can be used to

rank retrieved documents

Page 8: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 8

Recall: Inverted Index Constructioni Invert documents into a big index

4 vector file “inverted” so that rows become columns and columns become rows

i Basic idea:4 list all the tokens in the collection4 for each token, list all the docs it occurs in (together with frequency info.)

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

Sparse Matrix Representation: In practice this data is very sparse; we do not need to store all the 0’s. Hence, the sorted array implementation …

Page 9: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 9

How Are Inverted Files Created

i Sorted Array Implementation4 Documents are parsed to extract tokens. These are

saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 10: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 10

How Inverted Files are Created

i After all documents have been parsed and the inverted file is sorted (with duplicates retained for within document frequency stats)

i If frequency information is not needed, then inverted file can be sorted with duplicates removed.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 11: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 11

How Inverted Files are Created

i Multiple term entries for a single document are merged

i Within-document term frequency information is compiled

i If proximity operators are needed, then the location of each occurrence of the term must also be stored.

i Terms are usually represented by unique integers to fix and minimize storage space.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Page 12: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 12

How Inverted Files are CreatedThen the file can be split into a Dictionary and a Postings file

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Notes: The links between postings for a term is usually implemented as a linked list. The dictionary is enhanced with some term statistics such as Document frequency and the total frequency in the collection.

Page 13: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 13

Inverted Indexes and Queriesi Permit fast search for individual termsi For each term, you get a hit list consisting of:

4 document ID 4 frequency of term in doc (optional) 4 position of term in doc (optional)

i These lists can be used to solve quickly Boolean queries:i country ==> {d1, d2}i manor ==> {d2}i country AND manor ==> {d2}

i Full advantage of this structure can taken by statistical ranking algorithms such as the vector space model4 in case of Boolean queries, term or document frequency information is not used

(just set operations performed on hit lists)

i We will look at the vector model later; for now let’s examine Boolean queries more closely

Page 14: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 14

Scalability Issues: Number of Postings

An Example:

i Number of docs = m = 1M4 Each doc has 1K terms

i Number of distinct terms = n = 500K

i 600 million postings entries

Page 15: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 15

Bottleneck

i Parse and build postings entries one doc at a timei Sort postings entries by term (then by doc within each

term)i Doing this with random disk seeks would be too slow –

must sort N=600M records

If every comparison took 2 disk seeks (10 milliseconds each), and N items could be sorted with N log2N comparisons, how long would this take?

Page 16: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 16

Sorting with fewer disk seeks

i 12-byte (4+4+4) records (term, doc, freq)i These are generated as we parse docsi Must now sort 600M such 12-byte records by

termi Define a Block (e.g., ~ 10M) recordsi Sort within blocks first, then merge the blocks

into one long sorted order.i Blocked Sort-Based Indexing (BSBI)

Page 17: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Problem with sort-based algorithm

i Assumption: we can keep the dictionary in memory.i We need the dictionary (which grows dynamically) in

order to implement a term to termID mapping.i Actually, we could work with term,docID postings

instead of termID,docID postings . . .i . . . but then intermediate files become very large. (We

would end up with a scalable, but very slow index construction method.)

Sec. 4.3

Page 18: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

SPIMI: Single-pass in-memory indexing

i Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks.

i Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur.

i With these two ideas we can generate a complete inverted index for each block.

i These separate indexes can then be merged into one big index.

Sec. 4.3

Page 19: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 19

Distributed indexing

i For web-scale indexing 4 must use a distributed computing cluster

i Individual machines are fault-prone4 Can unpredictably slow down or fail

i How do we exploit such a pool of machines?4 Maintain a master machine directing the indexing job – considered

“safe”.4 Break up indexing into sets of (parallel) tasks.4 Master machine assigns each task to an idle machine from a pool.

Page 20: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 20

Parallel tasksi Use two sets of parallel tasks

4 Parsers4 Inverters

i Break the input document corpus into splits4 Each split is a subset of documents4 E.g., corresponding to blocks in BSBI

i Master assigns a split to an idle parser machinei Parser reads a document at a time and emits (term, doc)

pairs4 writes pairs into j partitions4 Each partition is for a range of terms’ first letters (e.g., a-f, g-p, q-z) – here

j = 3.

i Inverter collects all (term, doc) pairs for a partition; sorts and writes to postings list

Page 21: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 21

Data flow

splits

Parser

Parser

Parser

Master

a-f g-p q-z

a-f g-p q-z

a-f g-p q-z

Inverter

Inverter

Inverter

Postingsa-f

g-p

q-z

assign assign

Mapphase

Segment files Reducephase

Sec. 4.4

Page 22: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 22

Dynamic indexingi Problem:

4 Docs come in over timei postings updates for terms already in dictionaryi new terms added to dictionary

4 Docs get deleted

i Simplest Approach4 Maintain a “big” main index4 New docs go into a “small” auxiliary index4 Search across both, merge results4 Deletions

i Invalidation bit-vector for deleted docsi Filter docs output on a search result by this invalidation bit-vector

4 Periodically, re-index into one main index

Page 23: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 23

Index on disk vs. memory

i Most retrieval systems keep the dictionary in memory and the postings on disk

i Web search engines frequently keep both in memory4 massive memory requirement

4 feasible for large web service installations

4 less so for commercial usage where query loads are lighter

Page 24: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 24

Retrieval From Indexesi Given the large indexes in IR applications, searching for

keys in the dictionaries becomes a dominant costi Two main choices for dictionary data structures: Hashtables

or Trees4 Using Hashing

i requires the derivation of a hash function mapping terms to locationsi may require collision detection and resolution for non-unique hash

values

4 Using Treesi Binary search treesi nice properties, easy to implement, and effectivei enhancements such as B+ trees can improve search effectivenessi but, requires the storage of keys in each internal node

Page 25: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Hashtables

i Each vocabulary term is hashed to an integer4 (We assume you’ve seen hashtables before)

i Pros:4 Lookup is faster than for a tree: O(1)

i Cons:4 No easy way to find minor variants:

i judgment/judgement

4 No prefix search [tolerant retrieval]4 If vocabulary keeps growing, need to occasionally do

the expensive operation of rehashing everything

Sec. 3.1

25

Page 26: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Trees

i Simplest: binary treei More usual: B-treesi Trees require a standard ordering of characters and

hence strings … but we typically have onei Pros:

4 Solves the prefix problem (e.g., terms starting with hyp)

i Cons:4 Slower: O(log M) [and this requires balanced tree]4 Rebalancing binary trees is expensive

i But B-trees mitigate the rebalancing problem

Sec. 3.1

26

Page 27: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Roota-m n-z

a-hu hy-m n-sh si-z

aardvark

huygens

sickle

zygot

Tree: binary tree

Sec. 3.1

27

Page 28: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Tree: B-tree

4 Definition: Every internal node has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].

a-huhy-m

n-z

Sec. 3.1

28

Page 29: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 29

Recall: Steps in Basic Automatic Indexing

i Parse documents to recognize structure

i Scan for word tokens

i Stopword removal

i Stem words

i Weight words

Page 30: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 30

Indexing Models (aka “Term Weighting”)

i Basic issue: which terms should be used to index a document, and how much should it count?

i Some approaches4 binary weights

i Terms either appear or they don’t; no frequency information used.

4 term frequencyi Either raw term counts or (more often) term counts divided by total

frequency of the term across all documents

4 TF.IDF (inverse document frequency model)4 Term discrimination model4 Signal-to-noise ratio (based on information theory)4 Probabilistic term weights

Page 31: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 31

Binary Weightsi Only the presence (1) or absence (0) of a term is

included in the vector

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1D11 1 0 1

This representation can be particularly useful, since the documents (and the query) can be viewed as simple bit strings. This allows for query operations be performed using logical bit operations.

Page 32: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 32

Binary Weights:Matching of Documents & Queries

docs t1 t2 t3 Rank=Q.DiD1 1 0 1 2D2 1 0 0 1D3 0 1 1 2D4 1 0 0 1D5 1 1 1 3D6 1 1 0 2D7 0 1 0 1D8 0 1 0 1D9 0 0 1 1

D10 0 1 1 2D11 1 0 1 2Q 1 1 1

q1 q2 q3

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

D11

t2

t3

t1

4 In the case of binary weights, matching between documents and queries can be seen as the size of the intersection of two sets (of terms): |Q Ç D|. This in turn can be used to rank the relevance of documents to a query.

Page 33: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 33

Beyond Binary Weight

docs t1 t2 t3 Rank=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3

D10 0 1 1 5D11 1 0 1 3Q 1 2 3

q1 q2 q3

1 2, , , nX x x x

1 2, , , nY y y y

4 More generally, similarity between the query and the document can be seen as the dot product of two vectors: Q · D (this is also called simple matching)

4 Note that if both Q and D are binary this is the same as: |Q Ç D|

Given two vectors X and Y:

Simple matching measures the similarity between X and Y as the dot product of X and Y:

i

ii yxYXYXsim ),(

Page 34: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 34

Raw Term Weights

i The frequency of occurrence for the term in each document is included in the vector

docs t1 t2 t3 RSV=Q.DiD1 2 0 3 11D2 1 0 0 1D3 0 4 7 29D4 3 0 0 3D5 1 6 3 22D6 3 5 0 13D7 0 8 0 16D8 0 10 0 20D9 0 0 1 3

D10 0 3 5 21D11 4 0 1 7Q 1 2 3

q1 q2 q3

Now the notion of simple matching (dot product) incorporates the term weights from both the query and the documents.

Using raw term weights provides the ability to better distinguish among retrieved documentsNote: Although “term frequency” is

commonly used to mean raw occurrence count, technically it implies that raw count is divided by the document length (total no. of term occurrences in the document).

Page 35: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Term Weights: TF

i More frequent terms in a document are more important, i.e. more indicative of the topic.

fij = frequency of term i in document j.

i May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document:

tfij = fij / maxi{fij}

i Or sublinear tf scaling:

tfij = 1 + log fij

35

Page 36: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 36

Normalized Similarity Measuresi With or without normalized weights, it is possible to incorporate

normalization into various similarity measuresi Example (Vector Space Model)

4 in simple matching, the dot product of two vectors measures the similarity of these vectors

4 the normalization can be achieved by dividing the dot product by the product of the norms of the two vectors

4 given a vector

the norm of X is:

4 the similarity of vectors X and Y is:

1 2, , , nX x x x Note: this measures the cosine of the angle between two vectors; it is thus called the normalized cosine similarity measure.

Note: this measures the cosine of the angle between two vectors; it is thus called the normalized cosine similarity measure.

i

ixX 2

ii

ii

iii

yx

yx

yX

YXYXsim

22

)(),(

Page 37: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 37

Normalized Similarity Measures

docs t1 t2 t3 SIM(Q,Di)D1 2 0 3 0.82D2 1 0 0 0.27D3 0 4 7 0.96D4 3 0 0 0.27D5 1 6 3 0.87D6 3 5 0 0.60D7 0 8 0 0.53D8 0 10 0 0.53D9 0 0 1 0.80

D10 0 3 5 0.96D11 4 0 1 0.45Q 1 2 3

q1 q2 q3

docs t1 t2 t3 RSV=Q.DiD1 2 0 3 11D2 1 0 0 1D3 0 4 7 29D4 3 0 0 3D5 1 6 3 22D6 3 5 0 13D7 0 8 0 16D8 0 10 0 20D9 0 0 1 3

D10 0 3 5 21D11 4 0 1 7Q 1 2 3

q1 q2 q3

Using normalizedcosine similarity

Note that the relative ranking among documents has changed!

Page 38: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 38

tf x idf Weighting

i tf x idf measure:4 term frequency (tf)4 inverse document frequency (idf) -- a way to deal with the

problems of the Zipf distribution4 Recall the Zipf distribution4 Want to weight terms highly if they are

i frequent in relevant documents … BUTi infrequent in the collection as a whole

i Goal: assign a tf x idf weight to each term in each document

Page 39: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 39

tf x idf

)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

Page 40: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 40

Inverse Document Frequency

i IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

Page 41: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 41

tf x idf normalization

i Normalize the term weights (so longer documents are not unfairly given more weight)4 normalize usually means force all values to fall within a certain range,

usually between 0 and 1, inclusive4 this is more ad hoc than normalization based on vector norms, but the

basic idea is the same:

2 2

1

log( / )

( ) [log( / )]

ik kik t

ik kk

tf N nw

tf N n

Page 42: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

tf x idf Example

42

  Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 df idf = log2(N/df)T1 0 2 4 0 1 0   3 1.00T2 1 3 0 0 0 2   3 1.00T3 0 1 0 2 0 0   2 1.58T4 3 0 1 5 4 0   4 0.58T5 0 4 0 0 0 1   2 1.58T6 2 7 2 1 3 0   5 0.26T7 1 0 0 5 5 1   4 0.58T8 0 1 1 0 0 3   3 1.00

  Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6T1 0.00 2.00 4.00 0.00 1.00 0.00T2 1.00 3.00 0.00 0.00 0.00 2.00T3 0.00 1.58 0.00 3.17 0.00 0.00T4 1.75 0.00 0.58 2.92 2.34 0.00T5 0.00 6.34 0.00 0.00 0.00 1.58T6 0.53 1.84 0.53 0.26 0.79 0.00T7 0.58 0.00 0.00 2.92 2.92 0.58T8 0.00 1.00 1.00 0.00 0.00 3.00

The initial Term x Doc matrix

(Inverted Index)

tf x idfTerm x Doc matrix

Documents represented as vectors of words

Page 43: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

43

Alternative TF.IDF Weighting Schemes

i Many search engines allow for different weightings for queries vs. documents:

i A very standard weighting scheme is:4 Document: logarithmic tf, no idf, and cosine normalization

4 Query: logarithmic tf, idf, no normalization

Page 44: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 44

Keyword Discrimination Modeli The Vector representation of documents can be used as the

source of another approach to term weighting4 Question: what happens if we removed one of the words used as

dimensions in the vector space?4 If the average similarity among documents changes significantly, then

the word was a good discriminator4 If there is little change, the word is not as helpful and should be

weighted less

i Note that the goal is to have a representation that makes it easier for a queries to discriminate among documents

i Average similarity can be measured after removing each word from the matrix4 Any of the similarity measures can be used (we will look at a variety of

other similarity measures later).

Page 45: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 45

Keyword Discriminationi Measuring average similarity (assume there are N documents)

sim(D1,D2) = similarity score for pair of documents D1 and D2

i Better way to calculate AVG-SIM4 Calculate centroid D* (avg. document vector = Sum vectors / N)

4 Then:

simN

sim D Di ji j

12

( , ),

sim sim termk k when removed

i

i DDsimN

sim ),(1 *

Computationally Expensive

Page 46: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 46

Keyword Discrimination

i Discrimination value (discriminant) and term weights

i Computing Term Weights4 New weight for a term k in a document i is the original term

frequency of k in i time the discriminant value:

disc sim simk k

kikik disctfw

disck > 0 ==> termk is a good discriminantdisck < 0 ==> termk is a poor discriminantdisck = 0 ==> termk is indifferent

disck > 0 ==> termk is a good discriminantdisck < 0 ==> termk is a poor discriminantdisck = 0 ==> termk is indifferent

Page 47: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 47

Keyword Discrimination - Example

sim sim termk k when removed

docs t1 t2 t3D1 10 1 0D2 9 2 10D3 8 1 1D4 8 1 50D5 19 2 15D6 9 2 0D* 10.50 1.50 12.67

Doc-Sim to Centroidsim(D1,D*) 0.641sim(D2,D*) 0.998sim(D3,D*) 0.731sim(D4,D*) 0.859sim(D5,D*) 0.978sim(D6,D*) 0.640AVG-SIM 0.808

sim1(D1,D*) 0.118 sim2(D1,D*) 0.638 sim3(D1,D*) 0.999sim1(D2,D*) 0.997 sim2(D2,D*) 0.999 sim3(D2,D*) 0.997sim1(D3,D*) 0.785 sim2(D3,D*) 0.729 sim3(D3,D*) 1.000sim1(D4,D*) 0.995 sim2(D4,D*) 0.861 sim3(D4,D*) 1.000sim1(D5,D*) 1.000 sim2(D5,D*) 0.978 sim3(D5,D*) 0.999sim1(D6,D*) 0.118 sim2(D6,D*) 0.638 sim3(D6,D*) 0.997

SIM1 0.669 SIM2 0.807 SIM3 0.999

Using Normalized CosineUsing Normalized Cosine

Note: D* for each of the SIMk is now computed with only two termsNote: D* for each of the SIMk is now computed with only two terms

i

i DDsimN

sim ),(1 *

Page 48: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 48

Keyword Discrimination - Example

disc sim simk k

Term disc k

t1 -0.139t2 -0.001t3 0.191

t1 t2 t3D1 -1.392 -0.001 0.000D2 -1.253 -0.001 1.908D3 -1.114 -0.001 0.191D4 -1.114 -0.001 9.538D5 -2.645 -0.001 2.861D6 -1.253 -0.001 0.000

New Weights for Terms t1, t2, and t3

This shows that t1 tends to be a poor discriminator, while t3 is a good discriminator. The new term weight will now reflect the discrimination value for these terms. Note that further normalization can be done to make all term weights positive.

This shows that t1 tends to be a poor discriminator, while t3 is a good discriminator. The new term weight will now reflect the discrimination value for these terms. Note that further normalization can be done to make all term weights positive.

kikik disctfw

Page 49: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 49

Signal-To-Noise Ratioi Based on work of Shannon in 1940’s on Information Theory

4 Developed a model of communication of messages across a noisy channel

4 Goal is to devise an encoding of messages that is most robust in the face of channel noise

i In IR, messages describe the content of documents4 Amount of information about the document from a word is inversely

proportional to its probability of occurrence4 The least informative words are those that occur approximately

uniformly across the corpus of documentsi a word that occurs with the similar frequency across many documents (e.g.,

“the”, “and”, etc.) is less informative than one that occurs with high frequency in one or two documents

i Shannon used entropy (a logarithmic measure) to measure average information gain with noise defined as its inverse

Page 50: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 50

Signal-To-Noise Ratio

pk = Prob(term k occurs in document i) = tfik / tfk

Infok = - pk log pk

Noisek = - pk log (1/pk)

w tf SIGNALik ik k The weight of term k indocument i

log( )k k kSIGNAL tf NOISE

Note: here we always takelogs to be base 2.

Note: here we always takelogs to be base 2.

- log logik ikk k

i i k k

tf tfAVG INFO p p

tf tf

Note: NOISE is the

negation of AVG-INFO, so only one of these needs to be computed in practice.

Note: NOISE is thenegation of AVG-INFO, so only one of these needs to be computed in practice.

log(1/ ) logik kk k k

i i k ik

tf tfNOISE p p

tf tf

Page 51: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 51

Signal-To-Noise Ratio - Example

pk = tfik / tfk

This is the “entropy” of term k in the collection

docs t1 t2 t3D1 10 1 1D2 9 2 10D3 8 1 1D4 8 1 50D5 19 2 15D6 9 2 1tf k 63 9 78

Prob t1 Prob t2 Prob t3 Info (t1 ) Info (t2 ) Info (t3 )

0.159 0.111 0.013 0.421 0.352 0.081

0.143 0.222 0.128 0.401 0.482 0.380

0.127 0.111 0.013 0.378 0.352 0.081

0.127 0.111 0.641 0.378 0.352 0.411

0.302 0.222 0.192 0.522 0.482 0.457

0.143 0.222 0.013 0.401 0.482 0.081

AVG-INFO 2.501 2.503 1.490

Note: By definition, if theterm k does not appear inthe document, we assumeInfo(k) = 0 for that doc.

Note: By definition, if theterm k does not appear inthe document, we assumeInfo(k) = 0 for that doc. - log logik ik

k ki i k k

tf tfAVG INFO p p

tf tf

Page 52: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 52

Signal-To-Noise Ratio - Example

-kNOISE AVG INFO

w tf SIGNALik ik k

The weight of term k indocument i

log( )k k kSIGNAL tf NOISE

Term AVG-INFO NOISE SIGNALt1 2.501 -2.501 3.476t2 2.503 -2.503 0.667t3 1.490 -1.490 4.795

docs Weight t1 Weight t2 Weight t3D1 34.760 0.667 4.795D2 31.284 1.333 47.951D3 27.808 0.667 4.795D4 27.808 0.667 239.753D5 66.044 1.333 71.926D6 31.284 1.333 4.795

Additional normalization can be performed to have values in the range [0,1]Additional normalization can be performed to have values in the range [0,1]

Page 53: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 53

Probabilistic Term Weightsi Probabilistic model makes explicit distinctions between

occurrences of terms in relevant and non-relevant documentsi If we know

pi: probability of term xi appears in relevant doc.

qi: probability of term xi appears in non-relevant doc.

with binary and independence assumption, the the weight of term xi in document Dk is:

i Estimates of pi and qi requires relevance information:

4 using test queries and test collections to “train” the values of pi and qi

4 other AI/learning technique?

iki i

i i

wtp q

q p

log( )

( )

1

1

Page 54: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 54

Phrase Indexing

i Both statistical and syntactic methods have been used to identify “good” phrases

i Proven techniques include finding all word pairs that occur more than n times in the corpus or using a part-of-speech tagger to identify simple noun phrases

i Phrases can have an impact on effectiveness and efficiency 4 phrase indexing will speed up phrase queries 4 improve precision by disambiguating the word senses:

i e.g, “grass field” v. “magnetic field”

4 effectiveness not straightforward and depends on retrieval model i e.g. for “information retrieval”, how much do individual words count?

Page 55: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 55

Associating Weights with Phrasesi Typical Approach (Salton and McGill - 1983)

4 Compute pairwise co-occurrence for high-frequency words

4 If co-occurrence value is less than some threshold a, do not consider the pair any further

4 For qualifying pairs of terms (ti,tj) , compute the cohesion value

where s is a size factor determined by the size of the vocabulary; OR

4 If cohesion is above a threshold b, retain phrase as a valid index phrase

i Weight in the index will be a function of cohesion value

cohesion t tfreq t t

totfreq t totfreq ti ji j

i j

( , )( , )

( ) ( )

cohesion t tfreq t t

i ji j

freq t freq ti j

( , )( , )

( ) ( )

(Salton and McGill, 1983)

(Rada, 1986)

Page 56: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 56

Concept Indexingi More complex indexing could include concept or thesaurus classes

4 One approach is to use a controlled vocabulary (or subject codes) and map specific terms to “concept classes”

4 Automatic concept generation can use classification or clustering to determine concept classes

i Automatic Concept Indexing4 Words, phrases, synonyms, linguistic relations can all be evidence used to

infer presence of the concept i e.g. the concept “automobile” can be inferred based on the presence of the

words “vehicle”, “transportation”, “driving”, etc.

4 One approach is to represent each word as a “concept vector”i each dimension represents a weight for a concept associated with the termi phrases or index items can be represented as weighted averages of concept

vectors for the terms in them

4 Another approach: Latent Semantic Indexing (LSI)

Page 57: Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval

Intelligent Information Retrieval 57

Next

i Retrieval Models and Ranking Algorithms4 Boolean Matching and Boolean Queries4 Vector Space Model and Similarity Ranking4 Extended Boolean Models4 Basic Probabilistic Models4 Implementation Issues for Ranking Systems