81
Basic IR Models Vector Space Model Probabilistic models Solutions Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81

Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Chap 2: Classical models for information retrieval

Jean-Pierre Chevallet & Philippe Mulhem

LIG-MRIM

Sept 2016

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81

Page 2: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 2 / 81

Page 3: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

IR System

Request! Documents!

Representation !of queries!

(query language)!

Representation of !the content documents!

(indexing language)!

Implementation of Matching

function!

Query! Surrogate!

Information Retrieval System (IRS)!

Knowledge!

Base!

Query model! Document model!(content)!

Matching!function!

Information Retrieval Model!

Knowledge model!

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 3 / 81

Page 4: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

IR Models

Classical Models

Boolean Set Vector Space Probabilistic

Strict Weighted Classical Inference Network

Language Model

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 4 / 81

Page 5: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 5 / 81

Page 6: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Set Model

Membership of a set

Requests are single descriptors

Indexing assignments : a set of descriptors

In reply to a request, documents are either retrieved or not(no ordering)

Retrieval rule : if the descriptor in the request is a member ofthe descriptors assigned to a document, then the document isretrieved.

This model is the simplest one and describes the retrievalcharacteristics of a typical library where books are retrieved bylooking up a single author, title or subject descriptor in a catalog.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 6 / 81

Page 7: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Example

Request

”information retrieval”

Doc1

{”information retrieval”, ”database” , ”salton”}−− > RETRIEVED < −−

Doc2

{ ”database” , ”SQL”}−− > NOT RETRIEVED < −−

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 7 / 81

Page 8: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Set inclusion

Request: a set of descriptors

Indexing assignments : a set of descriptors

Documents are either retrieved or not (no ordering)

Retrieval rule : document is retrieved if ALL the descriptors inthe request are in the indexing set of the document.

This model uses the notion of inclusion of the descriptor set of therequest in the descriptor set of the document

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 8 / 81

Page 9: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Set intersection

Request: a set of descriptors PLUS a cut off value

Indexing assignments : a set of descriptors

Documents are either retrieved or not (no ordering)

Retrieval rule : document is retrieved if it shares a number ofdescriptors with the request that exceeds the cut-off value

This model uses the notion of set intersection between thedescriptor set of the request with the descriptor set of thedocument.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 9 / 81

Page 10: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Set intersection plus ranking

Request: a set of descriptors PLUS a cut off value

Indexing assignments : a set of descriptors

Retrieved documents are ranked

Retrieval rule : documents showing with the request more than thespecified number of descriptors are ranked in order of decreasingoverlap

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 10 / 81

Page 11: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 11 / 81

Page 12: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Logical Models

Based on a given formalized logic:

Propositional Logic: boolean model

First Order Logic : Conceptual Graph Matching

Modal Logic

Description Logic: matching with knowledge

Concept Analysis

Fuzzy logic

...

Matching : a deduction from the query Q to the document D.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 12 / 81

Page 13: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Boolean Model

Request is any boolean combination of descriptors using theoperators AND, OR and NOT

Indexing assignments : a set of descriptors

Retrieved documents are retrieved or not

Retrieval rules:

if Request = ta ∧ tb then retrieve only documents with bothta and tb

if Request = ta ∨ tb then retrieve only documents with eitherta or tb

if Request = ¬ta then retrieve only documents without ta.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 13 / 81

Page 14: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Boolean Model

Knowledge Model : T = {ti}, i ∈ [1, ..N]

Term ti that index the documents

The document model (content) I a Boolean expression in theproposition logic, with the ti considered as propositions:

A document D1 = {t1, t3} is represented by the logic formulaas a conjunction of all terms direct (in the set) or negated.t1 ∧ ¬t2 ∧ t3 ∧ ¬t4 ∧ ... ∧ ¬tN−1 ∧ ¬tNA query Q is represented by any logic formula

The matching function is the logical implication: D |= Q

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 14 / 81

Page 15: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Boolean Model: relevance value

No distinction between relevant documents:

Q = t1 ∧ t2 over the vocabulary {t1, t2, t3, t4, t5, t6}D1 = {t1, t2} ≡ t1 ∧ t2 ∧ ¬t3 ∧ ¬t4 ∧ ¬t5 ∧ ¬t6

D2 = {t1, t2, t3, t4, t5} ≡ t1 ∧ t2 ∧ t3 ∧ t4 ∧ t5 ∧ ¬t6

Both documents are relevant because : D1 ⊃ Q and D2 ⊃ Q.We ”feel” that D1 is a better response because ”closer” to thequery.Possible solution : D ⊃ Q and Q ⊃ D. (See chapter on LogicalModels)

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 15 / 81

Page 16: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Boolean Model: complexity of queries

Q = ((t1 ∧ t2) ∨ t3) ∧ (t4 ∨ ¬(¬t5 ∧ t6))

Meaning of the logical ∨ (inclusive) different from the usual ”or”(exclusive)

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 16 / 81

Page 17: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Conclusion for boolean model

Model use till the the 1990.

Very simple to implement

Still used is some web search engine as a basic document fastfilter with an extra step for document ordering.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 17 / 81

Page 18: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 18 / 81

Page 19: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Beyond boolean logic: choice of a logic

A lot of possible choices to go beyond the boolean model. Thechoice of the underlying logic L to models the IR matchingprocess:

Propositional Logic

First order logic

Modal Logic

Fuzzy Logic

Description logic

....

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 19 / 81

Page 20: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Beyond boolean logic: matching interpretation

For a logic L, different possible interpretations of matching:

Deduction theory

D `L Q

`L D ⊃ Q

Model theory

D |=L Q

|=L D ⊃ Q

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 20 / 81

Page 21: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Interpretations

Documents answer to q=t1"t3

1011

0000

0001 0010 0011

0100 0110

0111

1001

Document d = {t1, t2, t3}

1010

1111 1110

1100 1101

1000

0101

Other Documents

Other interpretations

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 21 / 81

Page 22: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Inverted File

a " b

a b c d !

0 1 0 0 1 1 0 1 !! 1 1 1 1 0 1 0 0 !! 0 0 0 1 0 0 0 0 !!

d1 d2 d3 !.

0 1 0 0 0 1 0 0 !..

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 22 / 81

Page 23: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Implementing the boolean model

Posting list:

List of unique doc id associates with an indexing term

If no ordering, this list is in fact a set

If ordering, nice algorithm to fuse lists

Main set (list) fusion algorithms:

Intersection, usion and complement

Algorithm depend on order.

If no order complexity is O(n ∗m)

if order, complexity is O(n + m) ! So maintain posting listordered.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 23 / 81

Page 24: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Implementing the boolean model

Left to exercise : produces the 3 algorithms, of union, intersectionand complement, with the following constraints:

The two lists are read only

The produced list is e new one and is write only

Using sorting algorithms is not allowed

Deduce from the algorithms the compexity.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 24 / 81

Page 25: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 25 / 81

Page 26: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Weighted Boolean Model

Extension of Boolean Model with weights.Weight denoting the representativity of a term for a document).Knowledge Model : T = {ti}, i ∈ [1, ..N]Terms ti that index the documents.

A document D is represented by

A logical formula D (similar to Boolean Model)

A function WD : T → [0, 1], which gives, for each term in Tthe weight of the term in D. The weight is 0 for a term notpresent in the document.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 26 / 81

Page 27: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Weighted Boolean Model: matching

Non binary matching function based on Fuzzy logic

RSV (D, a ∨ b) = Max [WD(a),WD(b)]

RSV (D, a ∧ b) = Min[WD(a),WD(b)]

RSV (D,¬a) = 1−WD(a)

Limitation : this matching does not take all the query termsinto account.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 27 / 81

Page 28: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Weighted Boolean Model: matching

Non binary matching function based on a similarity function whichtake more into account all the query terms.

RSV (D, a ∨ b) =

√WD(a)2+WD(b)2

2

RSV (D, a ∧ b) = 1−√

(1−WD(a))2+(1−WD(b))2

2

RSV (D,¬a) = 1−WD(a)

Limitation : query expression for complex needs

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 28 / 81

Page 29: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Weighted Boolean Model: matching example

0#0#0#0#0#0#D4

1- 1/$2#1/$2 0#1#1#0#D3

1- 1/$2=0.29 1/$2=0.71#0#1#0#1#D2

1 1 1#1#1#1#D1

a ! b a " b a ! b a " b b a Boolean Weighted Boolean

Document Query

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 29 / 81

Page 30: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Weighted Boolean Model: matching example

d1=(0.2, 0.4) , d2=(0.6, 0.9), d3=(0,1)

(0,0) b b

a

(0,0)

(1,1)

d1 d1 d2 d2

0.4

0.6

0.2 0.9 0.2 0.9

a or b

a and b d3 0.71

To be closer to (1,1) To avoid (0,0) (1,1)

d3

0.29

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 30 / 81

Page 31: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 31 / 81

Page 32: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

Upper bounds for results

Let ti indexing term, let #ti the size of its posting list and let |Q|the number of documents that answer the boolean query Q.

1 Propose an upper bound value in term of posting list size forthe queries t1 ∨ t2 and t1 ∧ t2

2 If we suppose that #t1 > #t2 > #t3, compute the upperbound value of |t1 ∧ t2 ∧ t3|, then propose a way to reduceglobal computation of posting list merge1.

3 If we suppose that #t1 > #t2 > #t3, compute the upperbound value of |t1 ∨ t2| and of |(t1 ∨ t2) ∧ t3|, then propose away to reduce global computation of posting list merge ofquery (t1 ∨ t2) ∧ t3

1Use the complexity value of merge depending of list sizesJean-Pierre Chevallet & Philippe Mulhem Models of IR 32 / 81

Page 33: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 33 / 81

Page 34: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Vector Space Model

Request: a set of descriptors each of which has a positivenumber associated with it

Indexing assignments : a set of descriptors each of which hasa positive number associated with it.

Retrieved documents are ranked

Retrieval rule : the weights of the descriptors common to therequest and to indexing records are treated as vectors. The valueof a retrieved document is the cosine of the angle between thedocument vector and the query vector.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 34 / 81

Page 35: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Vector Space Model

Knowledge model: T = {ti}, i ∈ [1, .., n]All documents are described using this vocabulary.A document Di is represented by a vector di described in the Rn

vector space defined on Tdi = (wi ,1,wi ,2, ...,wi ,j , ...,wi ,n), with wk,l the weight of a term tlfor a document.A query Q is represented by a vector q described in the samevector space q = (wQ,1,wQ,2, ...,wQ,j , ...,wQ,n)

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 35 / 81

Page 36: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Vector Space Model

The more two vectors that represent documents are ?near?, themore the documents are similar:

Di

Term 1

Term 3

Term 2

Dj

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 36 / 81

Page 37: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Vector Space Model: matching

Relevance: is related to a vector similarity.RSV (D,Q) =def SIM( ~D, ~Q)

Symmetry: SIM( ~D, ~Q) = SIM( ~Q, ~D)

Normalization: SIM : V → [min,max ]

Reflectivity : SIM(~X , ~X ) = max

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 37 / 81

Page 38: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 38 / 81

Page 39: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Vector Space Model: weighting

Based on counting the more frequent words, and also the moresignificant ones.

From Luhn 58

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 39 / 81

Page 40: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Zipf Law

Rank r and frequency f :

r × f = const

f1

fr

t1 tr Jean-Pierre Chevallet & Philippe Mulhem Models of IR 40 / 81

Page 41: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Heap Law

Estimation of the vocabulary size |V | according to the size |C | ofthe corpus:

|V | = K × |C |β

for English : K ∈ [10, 100] and β ∈ [0.4, 0.6] (e.g., |V | ≈ 39 000for |C |=600 000, K=50, β=0.5)

T

V

Ti Jean-Pierre Chevallet & Philippe Mulhem Models of IR 41 / 81

Page 42: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Term Frequency

Term Frequency

The frequency tfi ,j of the term tj in the document Di equals to thenumber of occurrences of tj in Di .

Considering a whole corpus (document database) into account, aterm that occurs a lot does not discriminate documents:

Term frequent in the whole corpus

Term frequent in only one document of the corpus

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 42 / 81

Page 43: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Document Frequency: tf.idf

Document Frequency

The document frequency dfj of a term tj is the number ofdocuments in which tj occurs.

The larger the df , the worse the term for an IR point of view... so,we use very often the inverse document frequency idfj :

Inverse Document Frequency

idfj = 1dfj

idfj = log( |C |dfj) with |C | is the size of the corpus, i.e. the number of

documents.

The classical combination : wi ,j = tfi ,j × idfj .

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 43 / 81

Page 44: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

matching

Matching function: based on the angle between the query vector ~Qand the document vector ~Di .The smaller the angle the more the document matches the query.

Di

Term 1

Term 3

Term 2

Q

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 44 / 81

Page 45: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

matching: cosine

One solution is to use the cosine the angle between the queryvector and the document vector.

Cosine

SIMcos( ~Di , ~Q) =

n∑k=1

wi ,k × wQ,k√√√√√√n∑

k=1

(wi ,k)2 ×n∑

k=1

(wQ,k)2

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 45 / 81

Page 46: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

matching: set interpretation

When binary discrete values, on have a set interpretations D{}i and

Q{}:

SIMcos( ~Di , ~Q) =|D{}i ∩ Q{}|√|D{}i | × |Q{}|

SIMcos( ~Di , ~Q) = 0 : D{}i and Q{} are disjoined

SIMcos( ~Di , ~Q) = 1 : D{}i and Q{} are equal

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 46 / 81

Page 47: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Other matching functions

Dice coefficient

SIMdice( ~Di , ~Q) =

2N∑

k=1

wi ,k × wQ,k

N∑k=1

wi ,k + wQ,k

Discrete Dice coefficient

SIMdice( ~Di , ~Q) =2|D{}i ∩ Q{}||D{}i |+ |Q{}|

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 47 / 81

Page 48: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 48 / 81

Page 49: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Matching cosine simplification

Question

Transform the cosine formula so that to simplify the matchingformula.

Cosine

SIMcos( ~Di , ~Q) =

n∑k=1

wi ,k × wQ,k√√√√√√n∑

k=1

(wi ,k)2 ×n∑

k=1

(wQ,k)2

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 49 / 81

Page 50: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Exercice: Similarity et dissimilarity and distance

The Euclidian distance between point x and y is expressed by :

L2(x , y) =∑i

(xi − yi )2

Question

Show the potential link between L2 and the cosine.

Tips:

consider points as vectors

consider document vectors has normalized, i.e. with∑

i y2i as

constant.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 50 / 81

Page 51: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Exercice: dot product and list intersection

The index is usually very sparse: a (usefull) term appears in lessthan 1% of document.A term t has a non null weight in ~D iff t appears in document D

Question

Show that the algorithm for computing the dot product isequivalent to list intersections.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 51 / 81

Page 52: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Exercice: efficiency of list intersection

Let the two lists X and Y of size |X | and |Y |:

Question

Compare the efficiency of a non sorted list intersection algorithmand a sorted list intesection.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 52 / 81

Page 53: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 53 / 81

Page 54: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Link between dot product and inverted files

If RSV (x , y) ∝∑

i xiyi after some normalizations of y , then :

RSV (x , y) ∝∑

i ,xi 6=0,yi 6=0

xiyi

Query: just to proceed with terms that are in the query, i.e.whose weight are not null.

Documents: store only non null terms.

Inverted file: access to non null document weight for for eachterm id.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 54 / 81

Page 55: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

The index

Use a Inverted file like Boolean model

Store term frequency in the posting list

Do not pre-compute the weight, but keep raw integer valuesin the index

Compute the matching value on the fly, i.e. during posting listintersection

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 55 / 81

Page 56: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Matching: matrix product

With an inverted file, in practice, matching computation is amatrix product:

a ! b

a b c d !

0 1 0 0 1 1 0 1 !! 1 1 1 1 0 1 0 0 !! 0 0 0 1 0 0 0 0 !!

d1 d2 d3 !.

1 1 1 1 1 1 0 1 !.. 1 1 0 0 !

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 56 / 81

Page 57: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 57 / 81

Page 58: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Relevance feedback

To learn system relevance from user relevance:

First query

Satistaction

Answer presentation

End Yes no

Matching

User eval

Start

reformulation

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 58 / 81

Page 59: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

WeightingExercicesImplementing VSMLeaning from user

Rocchio Formula

Rocchio Formula

~Qi+1 = α ~Qi + β ~Rel i − γ ~nRel i

With:~Rel i : the cluster center of relevant documents, i.e., the

positive feedback~nRel i : the cluster center of non relevant documents, i.e., the

negative feedback

Note : when using this formula, the generated query vector maycontain negative values.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 59 / 81

Page 60: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 60 / 81

Page 61: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Probabilistic Models

Capture the IR problem in a probabilistic framework.

first probabilistic model (Binary Independent Retrieval Model)by Robertson and Spark-Jones in 1976. . .

late 90s, emergence of language models, still hot topic in IR

Overall question: ”what is the probability for a document tobe relevant to a query ?” several interpretation of thissentence

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 61 / 81

Page 62: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Models

Classical Models

Boolean Vector Space Probabilistic

[Robertson & Spärk-Jones] [Tutle & Croft]

[Croft], [Hiemstra][Nie]

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 62 / 81

Page 63: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Probabilistic Model of IR

Different approaches of seeing a probabilistic approach forinformation retrieval

Classical approach: probability to have the event Relevantknowing one document and one query.

Inference Networks approach: probability that the query istrue after inference from the content of a document.

Language Models approach: probability that a query isgenerated from a document.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 63 / 81

Page 64: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Binary Independent Retrieval Model

Robertson & Spark-Jones 1976

computes the relevance of a document from the relevanceknown a priori from other documents.

achieved by estimating of each indexing term a the document,and by using the Bayes Theorem and a decision rule.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 64 / 81

Page 65: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Binary Independent Retrieval Model

R: binary random variable

R = r : relevant; R = r̄ : non relevant

P(R = r |d , q): probability that R is relavant for thedocument and the query considered (noted P(r |d , q) )

Probability of relevance depends only from document and query

term weights are binary: d = (11. . . 100. . . ),wdt = 0 or 1

Each term t is characterized by a binary variable wt , indicating theprobability that the term occurs.

P(wt = 1|q, r): probability that t occurs in a relevantdocument.

P(wt = 0|q, r) = 1–P(wt = 1|q, r)

The terms are conditionnaly independant to R

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 65 / 81

Page 66: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Binary Independent Retrieval Model

Non Relevant DocumentsR=r

CORPUS

Relevant DocumentsR=r

withCorpus = rel ∪ nonrelRelevant Documents ∩ Non Relevant Documents = ∅

For a query Q

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 66 / 81

Page 67: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Binary Independent Retrieval Model

Matching function : use of Bayes theorem

Probability that the document d belongs to the set of relevant documents of the query q.

Probability to obtain the description d from observed relevances

probability that the document d is picked for q

P(r,q) is the relevance probability. It is the chance of randomly taking one document from the corpus which is relevant for the query q

),(),().,(

),(qdP

qrPqrdPqdrP =

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 67 / 81

Page 68: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

BIR: Matching function

Decision: document retrieved if:

P(r |d , q)

P(r̄ |d , q)=

P(d |r , q).P(r , q)

P(d |r̄ , q).P(r̄ , q)> 1

IR looks for a ranking, so we eliminate P(r ,q)P(r̄ ,q) because it is constant

for a given query

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 68 / 81

Page 69: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

BIR: Matching function

In IR, it is more simple to use logs:

rsv(d) =rank log(P(d |r , q)

P(d |r̄ , q))

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 69 / 81

Page 70: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

BIR: Matching function

Simplification using hypothesis of independence between terms(Binary Independence) with weight w for term t in d

P(d |r , q) = P(d = (10...110...)|r , q)

=∏wdt =1

P(wdt = 1|r , q)

∏wdt =0

P(wdt = 0|r , q)

P(d |r̄ , q) = P(d = (10...110...)|r̄ , q)

=∏wdt =1

P(wdt = 1|r̄ , q)

∏wdt =0

P(wdt = 0|r̄ , q)

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 70 / 81

Page 71: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

BIR: Matching function

Let’s simplify the notations:

P(wdt = 1|r , q) = pt

P(wdt = 0|r , q) = 1− pt

P(wdt = 1|r̄ , q) = ut

P(wdt = 0|r̄ , q) = 1− ut

Hence :

P(d |r , q) =∏wdt =1

pt∏wdt =0

1− pt

P(d |r̄ , q) =∏wdt =1

ut∏wdt =0

1− ut

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 71 / 81

Page 72: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

BIR: Matching function

Now let’s compute the Relevance Status Value:

rsv(d) =rank log(P(d |r , q)

P(d |r̄ , q)) = log(

∏wdt =1 pt

∏wdt =0 1− pt∏

wdt =1 ut

∏wdt =0 1− ut

)

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 72 / 81

Page 73: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Conclusion

Common step for indexing:

Define index set

Automatize index construction from documents

Select a model for index representation and weighting

Define a matching process and an associated ranking

This is used for textual processing but also for other media.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 73 / 81

Page 74: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Outline

1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices

2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user

3 Probabilistic models

4 Solutions

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 74 / 81

Page 75: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Solution: matching cosine simplification

Because Q is constant, norm of vector Q is a constant that can beremoved.

K =1

n∑k=1

(wQ,k)2

SIMcos( ~Di , ~Q) = K

n∑k=1

wi ,k × wQ,k√√√√ n∑k=1

(wi ,k)2

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 75 / 81

Page 76: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Solution: matching cosine simplification

The division of the document vector by its norm can beprecomputed:

w ′i ,k =wi ,k√√√√ n∑

k=1

(wi ,k)2

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 76 / 81

Page 77: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Solution: matching cosine simplification

Hence, one juste has to compute a vector dot product:

SIMcos( ~Di , ~Q) = Kn∑

k=1

w ′i ,k × wQ,k

The constant does not influence the order:

RSVcos( ~Di , ~Q) ∝n∑

k=1

w ′i ,k × wQ,k

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 77 / 81

Page 78: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Solution: Similarity et dissimilarity and distance

To transform a distance or dissimilarity into a similarity, simplynegate the value. Ex. with the Squared Euclidian distance :

RSVD2(x , y) = −D2(x , y)2 = −∑i

(xi − yi )2

= −∑i

(x2i + y2

i − 2xiyi ) = −∑i

x2i −

∑i

y2i + 2

∑i

xiyi

if x is the query then∑

i x2i is constant for matching with a

document set.

RSV (x , y)D2 ∝ −∑i

y2i + 2

∑i

xiyi

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 78 / 81

Page 79: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Solution: Similarity et dissimilarity and distance

RSV (x , y)D2 ∝ −∑i

y2i + 2

∑i

xiyi

If∑

i y2i is constant over the corpus then :

RSV (x , y)D2 ∝ 2∑i

xiyi

RSV (x , y)D2 ∝∑i

xiyi

Hence, if we normalize the corpus so each document vector lengthis a constant, then using the Euclidean distance as a similarity,provides the same results than the cosine of the vector spacemodel !

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 79 / 81

Page 80: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Solution: dot product and list intersection

In the dot product∑

i xiyi , if the term at rank i is not in x or ythen the term xiyi is null and does not participate to the value.Hence,If we compute directly

∑i xiyi using a list of terms for each

document, then we sum the xiyi only for terms that are in theintersection.Hence, it is equivalent to an intersection list.

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 80 / 81

Page 81: Chap 2: Classical models for information retrieval - imag.frmrim.imag.fr/GINF533U/lib/exe/fetch.php?media=m2r... · Basic IR Models Vector Space Model Probabilistic models Solutions

Basic IR ModelsVector Space ModelProbabilistic models

Solutions

Solution: efficiency of list intersection

If X and Y are not sorted, for each xi one must find the term in Yby sequential search. So complexity is O(|X | × |Y |)If X and Y are sorted, the algorithm used two pointer:

Before the two pointer, lists are merged

Lets compare the pointers :

If terms are equal : we can compute the merge (or theproduct), then move the tow pointersIt terms are different : because of the order, one are sure thesmaller term to net be present in the other list. On ca movethat pointeur.

Hence one move at least one pointer on each list, the thecomplexity is : O(|X |+ |Y |)

Jean-Pierre Chevallet & Philippe Mulhem Models of IR 81 / 81