76
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities [email protected] Spring 2020 1

Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

信息检索与搜索引擎Introduction to Information Retrieval

GESC1007

Philippe Fournier-Viger

Full professor

School of Natural Sciences and Humanities

[email protected]

Spring 20201

Page 2: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Last week

We have discussed about:

◦ Index construction

◦ Index compression

QQ Group: 1059666166

Website: Videos, PPTs…

2

Page 3: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Course schedule (日程安排)

3

Lecture 1IntroductionBoolean retrieval (布尔检索模型)

Lecture 2 Term vocabulary and posting lists

Lecture 3 Dictionaries and tolerant retrieval

Lecture 4 Index construction and compression

Lecture 5Scoring, weighting, and the vector space model

Lecture 6 Computer scores, and a complete search system

Lecture 7 Evaluation in information retrieval

Lecture 8 Web search engines, advanced topics, andconclusion

Page 4: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

CHAPTER 6:

SCORING, TERM

WEIGHTING AND

THE VECTOR SPACE MODEL

4pdf p146

Page 5: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Introduction

5

Until now, we have only discussed

Boolean queries (布尔查询)

Shenzhen AND Beijing

car AND rental AND NOT buy

For each query, a document matches the query

or not.

(e.g. each document contains

“Shenzhen AND Beijing” or does not)

Page 6: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Introduction

For small document collections, a Boolean

query may retrieve few documents.

Query = HITSZ AND Basketball AND Football

Only three documents!

Page 7: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Introduction

7

For large document collections (e.g. Web),

the number of documents matching a query may

be huge.

Query = Beijing AND Restaurants

5 million webpages !

Page 8: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Introduction

8

For large document collections, a user may not be

able to look at all the results.

the user may not want or may not have time

to look at 5 million webpages

Thus, a search engine such as Baidu should not

show all the documents to the user.

Page 9: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Introduction

A search engine should:

◦ calculate a score for each document

indicating its relevance for the query.

◦ presents documents to the user by

decreasing order of relevance

(using a score得分).

◦ In other words, the most relevant documents

are presented first.

9

Example →

Page 10: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

10

Shenzhen airport

Page 11: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

11

More than 2

million results

Most relevant

document

Least relevant

documents

Advertisement

(广告)

Page 12: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

The structure of a document

Documents typically contains:

◦ Text

◦ Pictures, videos, and other information

A document may also contain data about the document itself

(metadata -元数据).

Example →

12

Page 13: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Example of metadata (元数据)

FieldAuthor:

Title:

Date of publication:

Date of creation:

Language:

13

Values

Shakespeare

Bla Bla Bla

1900/01/01

1899/0101

English

Metadata may consist of values for various

fields.

Page 14: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Parametric search

Definition: searching for documents

based on their properties (metadata元数据).

14

A search form

Page 15: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Parametric search

Definition: searching for documents

based on their properties (metadata元数据).

15

e.g. Searching all English documents

published in 1615 by Shakespeare

A search form

1615

Shakespeare

Page 16: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Another example“Find documents authored by Shakespeare in 1601, containing the word “England”

Author = “Shakespeare”Publication year = 1601Body contains “England”

How to find documents for this query?

Find documents containing “England” using a dictionary.

When searching using the dictionary, only retrieve the documents having the field values specified by the query (Author = Shakespeare, Year = 1601)

16

Page 17: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Search forms

Search forms have two types of components:

Field: takes a value from a limited set of values

Zone: the user can write any text

17

Publication date:

Title:

Keywords:

Abstract:

Field

Zones

Travel to Beijing

Page 18: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Indexes for fields and zones

It is easy to create an index for a field (field

index) because the set of terms to use for

searching is restricted and small.

2010, 2011, 2012….

It is more difficult to create a zone index because

all possible terms must be considered.

there can be many!

How to create a “zone index” ?

18

Page 19: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Zone index (1st approach)

A zone index is a dictionary, where each terms is

specific to its zone.

19

This indicates that the value “william”

appears in the zone “author” for

documents 2,3,5 and 8.

Page 20: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Zone index (1st approach)

A zone index is a dictionary, where each terms is

specific to its zone.

20

Problem: This would work but there might be too many

terms in the dictionary!

Page 21: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Zone index (2nd approach)

Another way of representing a zone index is the

following:

21

This indicates that the value “william” appears in

the zone “author” and in the “title” of document 2

Page 22: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Zone index (2nd approach)

Another way of representing a zone index is the

following:

22

Is it better?

• The number of terms in the dictionary is

reduced…

• But the posting lists may be longer

• Useful to search for terms appearing

anywhere in the document (title, author etc.)

Page 23: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Ranking documents using a zone index

A zone index is useful for rankingdocuments from most relevant to least relevant.

This technique is called:“weighted zone scoring”.

A score between 0 and 1 will be assigned to each document.

1 = highly relevant0 = irrelevant document

23Note: “Weighted zone scoring” is also called “Ranked Boolean retrieval”

Page 24: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

How to assign scores?

The score of a document:

The overall score of a document is

the sum of the scores of its zones.

Each zone of a document contributes

to the document’s score if the query

matches the zone,

Weights may be assigned to each zone

to indicate their relative importance.

(title may be more important than

author name)

24

Page 25: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Example

A collection of documents with three zones:

author, title and body

25

author:

title:

body:

Page 26: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Example

We can set the weights as follow:

weight of “author” = 0.2

weight of “title” = 0.3

weight of “body” = 0.5

26

author:

title:

body:

Page 27: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Example

We can set the weights as follow:

weight of “author” = 0.2

weight of “title” = 0.3

weight of “body” = 0.5

27

author:

title:

body:

Thus, a match in the author zone does not change

much the overall score, the title zone somewhat

more, and the body is the most important zone.

Page 28: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

ExampleWe will search for documents using this query:

QUERY = Shakespeare

28

Page 29: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Exampleweight of “author” = 0.2

weight of “title” = 0.3

weight of “body” = 0.5

If a document D1 contains “Shakespeare” in the title and body, its overall score is :

0 + 0.3 + 0.5 = 0.8

If a document D2 contains “Shakespeare” only in the body, its overall score is :

0 + 0 + 0.5 = 0.5

Thus, document D1 is ranked higher than document D2.

29

QUERY = Shakespeare

Page 30: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

How to calculate scores using a zone index?

e.g.: query: “william”

Using the index, we can quickly find all the documents containing “william”.

The information in the postling list(s), allows calculating the score of each document:

Doc 2: 0.2 + 0.3 = 0.5Doc 3: 0.2

Doc 4: 0.3Doc 5: 0.2

30

Page 31: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

How to choose the weights?

So how do we choose the weights?

Different weights will result in different rankings

of documents.

Weights could be set by a human expert.

31

weight of “author” = 0.2

weight of “title” = 0.3

weight of “body” = 0.5

weight of “author” = 0.4

weight of “title” = 0.4

weight of “body” = 0.2

Which one is better?

Page 32: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

How to choose the weights?

But often, the weights are “learned”

automatically using “machine learning

methods” (机器学习方法).

The idea is to try to find the best weights

automatically.

32

Page 33: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

How does it works?We have a set of training examples:

a query + some documents + a relevance judgment on

each document

Relevance judgement: {relevant, non_relevant} or a score.

33

The query matches the title

of this document

0 = no 1 = yes

The query matches

the body of this

document

0 = no 1 = yes

Page 34: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

How does it works?

The weights are then “learned” from these examples so that they will approximate the relevance judgments in the examples.

Let’say that: relevant = 1 non_relevant = 0

What should be the weights?

weight of body = ? 0.5 ? 0.6 ? 0.3?

weight of title = ? 0.5 ? 0.4 ? 0.7 ?

It is an optimization problem (优化问题). We should find the weights that provide the best approximation (minimizes the error).

34

Page 35: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

TERM FREQUENCY AND WEIGHTING

35

Page 36: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Introduction

Until now, we have considered that a term

appears or does not appear in a document.

We did not consider how often a term appears in

a document!

«shenzhen » appears 3 times in webpage 1

«shenzhen » appears 30 times in webpage 2

36

We need to consider term frequency !

(how many times a term appears)

Page 37: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Free text queries

Users of a search engine like Baidu will often search using a

list of keywords (terms) as query.

e.g. shenzhen airport schedule

This query is said to contains three terms or keywords.

This type of query is called a «free form query »

No Boolean operators are used.

37

Page 38: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Scoring documents for free text queries

To answer free text queries, the score of a

document should be based on:

◦ which keywords appear in the document,

◦ how many times they appear in the

document

38

Page 39: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Scoring documents for free text queries

Simple approach: The score of a document is the sum of the number of occurrences of each term in that document (Term Frequency –TF -词频).

QUERY: shenzhen aiport

DOCUMENT: Shenzhen is a big city, founded in 1979. Shenzhen has an aiport. There is also a few ports in Shenzhen

SCORE = 3 + 1 = 4

39

3 times «shenzhen »

+ 1 time «airport »

Page 40: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Bag of word model (词袋模型)

This approach for scoring document is

called the «bag of word model »

because it does not consider the order

between words.

40

Page 41: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

All words are equally important?

No!

Example: airports in Shenzhen

The word «in » is a very common word that is

in general not important (it is a stop word 停用词). We may ignore it.

41

Page 42: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Problem with term frequency

all terms are viewed as equally important by the previous approach.

words that are very frequent will not allow to discriminate between documents.

e.g. documents related to the automobile industry will likely all contain the word «auto »

42

Page 43: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Solution

Idea: we want to assign smaller weights to terms

that are very frequent.

Inverse document frequency (IDF) of a term:

(逆文档频率)

43

IDFt = log (𝐍

𝐃𝐅𝐭)

N = number of documents in the collection

DFt = number of documents containing the term

(document frequency)

Page 44: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Example

N = 806, 791 documents

44

IDFt = log (𝐍

𝐃𝐅𝐭)

Page 45: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

45

Page 46: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

How to combine document

frequency and term frequency?

To score each term in a document, a

popular measure is TF-IDF

TF = term frequency

IDF = inverse document frequency

46

Page 47: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

TF-IDF

47

TF-IDFt = TFt × IDFt

Term frequency -词频): number of times

that the term

appears in the

document

Inverse

document

frequency

(逆文档频率) of

the term

Page 48: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

TF-IDF

For a document, a term will get:

a high score if it appears many

times in the document but rarely

in other documents.

a lower score if the term appears

few times in the document, or if it

appears in many documents

a low score if the term appears in

almost all documents

48

Page 49: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

To score a document for a query

Based on the TF-IDF, we can calculate the score of a document for a query:

Score(query, document) = sum of the TD-IDF of all terms in the query for that

document.

Consider the query: Shenzhen Beijing

Score(Shenzhen + Beijing, Doc1) =

TD-IDFShenzhen + TD-IDFBeijing

49

Page 50: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

THE VECTOR SPACEMODEL FOR SCORING

50

Page 51: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Vector model (矢量模型)

The vector space model is a way of

representing documents that is often used

in information retrieval.

Each document is represented as a

vector (矢量).

51

Page 52: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Vector model (矢量模型)

A vector (矢量) representing a document

contains a score for each word of the

dictionary (using the TF-IDF measure or other

measure).

Suppose that we have only two words in the

dictionary: Shenzhen and Beijing

vector(doc1) = [0.7, 0.1]

52

Score of Shenzhen in

that document

Score of Beijing

in that document

A document

Page 53: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Vector model (矢量模型)

vector(doc1) = [0.7, 0.1]

53

Score of Shenzhen in

that document

Score of Beijing

in that document

A document

The vector of a document provides information about:

• the words contained in the document

• how often they appear in the document

But it does not keep information about the order between

the words

Page 54: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Vector model (矢量模型)

The collection of all documents can be

viewed as a set of vectors.

vector(doc1) = [0.4, 0.5]

vector(doc2) = [0.7, 0.1]

vector(doc3) = [0.4, 0.8]

vector(doc4) = [0.7, 0.9]

… …

..

54

Page 55: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Vector model (矢量模型)

The vectors of documents can be viewed as a

chart:

vector(doc1) = [0.4, 0.5]

vector(doc2) = [0.7, 0.1]

vector(doc3) = [0.4, 0.8]

vector(doc4) = [0.7, 0.9]

… …

..

55

0

0.2

0.4

0.6

0.8

1

0 0.5 1

Each document is a point or vector

(an arrow on the chart)

Page 56: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Vector model (矢量模型)

The vectors of documents can be viewed as a

chart:

vector(doc1) = [0.4, 0.5]

vector(doc2) = [0.7, 0.1]

vector(doc3) = [0.4, 0.8]

vector(doc4) = [0.7, 0.9]

… …

..

56

0

0.2

0.4

0.6

0.8

1

0 0.5 1

In our case, we have two dimensions (2D) because we

assume that two words “Shenzhen” and “Beijing” are

the only words in the dictionary

Page 57: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Vector model (矢量模型)

The vectors of documents can be viewed as a

chart:

vector(doc1) = [0.4, 0.5]

vector(doc2) = [0.7, 0.1]

vector(doc3) = [0.4, 0.8]

vector(doc4) = [0.7, 0.9]

… …

..

57

0

0.2

0.4

0.6

0.8

1

0 0.5 1

If we create the vectors using 3 words, we would

have 3 dimensions (3D) instead, and so on…

Page 58: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Similarity between documents

Using the vector space model, we can try to calculate

how similar two documents are.

58

Intuitively, two

documents are

expected to be

similar if their

vectors are close

to each other

Page 59: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Similarity between documents

However, longer documents, will tend to have larger values in

their vectors (when calculating scores using TF-IDF)

59

[0.7, 0.9]

[0.4, 0.5]

This would be

a longer

document

This would be

a shorter

document

Page 60: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Similarity between documents

How to calculate the similarity between two

vectors while considering the length of

documents?

Solution: calculate the cosine similarity (余弦相似度 ) of two documents d1 and d2.

Example →

60

Page 61: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

cosine similarity (余弦相似度 )

vector(doc1) = [a1, a2, … an]

vector(doc2) = [b1, b2, … bn]

cossimilarity(doc1,doc2) = (𝑎1×𝑏1 + 𝑎2×𝑏2 + … 𝑎𝑛 × 𝑏𝑛)

𝑎12+𝑎22+⋯+𝑎𝑛2× 𝑏12+𝑏22+⋯+𝑏𝑛2

61

Length of the vector (arrow)

of document doc1

Length of the vector

(arrow) of document doc2

Page 62: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

If we use the standard math notation:

𝑑𝑜𝑐1 = [a1, a2, … an]

𝑑𝑜𝑐2 = [b1, b2, … bn]

cossimilarity(𝑑𝑜𝑐1, 𝑑𝑜𝑐2) = 𝑑𝑜𝑐1 ∙𝑑𝑜𝑐2

|𝑑𝑜𝑐1||𝑑𝑜𝑐2|

62

Length

of this

vector

Length

of this

vector

dot

product

Page 63: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Example – 3 documents

63

We will use the Term frequency (TF) measure

instead of TF-IDF for the vectors

We will consider three documents with the

following term frequencies:

| 𝑑𝑜𝑐1| = 30.56 | 𝑑𝑜𝑐2| = 46.86 | 𝑑𝑜𝑐3| = 41.30

Page 64: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Example – 3 documents

64

• Then, we length-normalize each vector.

• This means to divide each values in a vector

Ԧv by the length of the vector | Ԧv |

With these values, we could easily calculate the

cosine similarity between any documents….

Page 65: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Example – 3 documents

We will not do all the calculations!

What is important is that we can now

calculate the similarity between two

vectors (documents)

65

Page 66: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Queries as vectors

Just like documents, a query can be

represented as a vector.

Example: assume a dictionary of three

words: affection, jealous, gossip

The query: jealous gossip

can be viewed as a vector:

vector(query) = [0, 0.7, 0.7]

66

Page 67: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Queries as vectors

The score of a document for a query can be

defined as the cosine similarity between the

query and the document.

score(query,document) =

cossimilarity(query, document)

67

query

Page 68: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Queries as vectors

Using this approach, a document may have a

high score for a query even if it does not

contain all the query terms

Note: in practice, we can use TF, TF-IDF or

other measures for calculating the scores of

terms in documents.

68

Page 69: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Queries as vectors (cont’d)

To search for documents:

◦ Calculate the cosine similarity between the query

and each document.

◦ Show only the documents to the user that have the

highest similarity with the query

69

top 10

documents

Page 70: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Queries as vectors (cont’d)

However, calculating the similarity between a

query and all documents is time-consuming!

We want a search engine to be fast!

70

top 10

documents

Page 71: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Computing vector scores

1) Set the score of all documents to 0.

2) For each query term t :

◦ calculate the weight of the query term t.

◦ obtain the posting list of t.

◦ increase the score of each document appearing in the posting list of t.

3) Divide the score of each document by its vector length.

4) Show the documents having the highest score to the user.

71

Some details…

Page 72: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Computing vector scores

A key observation is that:

If we have multiple computers, we can

calculate the score using several

computers working in parallel.

Each computer may score different

documents for a same query.

72

Some details…

Page 73: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

SUMMARY

73

Page 74: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Term frequency (TF): (词频) how many times

a term appears in a document

Document frequency (DF) (文档频率):

how many documents contain a term in a

collection of documents.

74

Inverse document frequency (IDF) of a term t:

(逆文档频率)

IDFt = log (𝐍

𝐃𝐅𝐭)

N = number of documents in the collection

DFt = document frequency of the term t

Page 75: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

Conclusion

Today, we discussed chapter 6 about

weighted Boolean retrieval.

We will continue next week…

The PPT slides are on the website.

75

Page 76: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector

References

Manning, C. D., Raghavan, P., Schütze, H.

Introduction to information retrieval. Cambridge:

Cambridge University Press, 2008

76