11
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

Embed Size (px)

DESCRIPTION

Information Retrieval (IR) The field of information retrieval deals with the representation, storage, organization of, access to information items. Search Engines: Search a large collection of documents to find the ones that satisfy an information need, e,g, find relevant documents Indexing Document representation Comparison with query Evaluation/feedback

Citation preview

Page 1: Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

Xiaoying GaoComputer Science

Victoria University of Wellington

COMP307NLP 4

Information Retrieval

Page 2: Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

Menu• Information Retrieval (IR) basics• An example• Evaluation: Precision and Recall

Page 3: Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

Information Retrieval (IR)The field of information retrieval deals with the

representation, storage, organization of, access to information items.

Search Engines:

Search a large collection of documents to find the ones that satisfy an information need, e,g, find relevant documents

• Indexing• Document representation• Comparison with query• Evaluation/feedback

Page 4: Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

IR basics• Indexing

• Manual indexing: using controlled vocabularies, e,g, libraries, early version of Yahoo

• Automatic indexing: indexing program assigns keywords, phrases or other features, e.g. words from text of document

• Popular Retrieval Models

• Boolean: exact match• Vector Space: best match

• Citation analysis models: best match

• Probabilistic models: best match

COMP 307 1:4

Page 5: Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

Vector space Model Any text object can be represented by a term vector.

Similarity is determined by distance in a vector space.

ExampleDoc1: 0.3, 0.1, 0.4Doc2: 0.8, 0.5, 0.6

Query: 0.0, 0.2,0.0

Vector Space Similarity: Cosine of the angle between the two vectors

n

ii

n

ii

n

i

ii

YX

YX

1

2

1

2

1

COMP 307 1:5

Page 6: Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

Text representationTerm weightsTerm weights reflect the estimated importance of each

term

The more often a word occurs in a document, the better that term is in describing what the document is about

But terms that appear in many documents in the collection are not very useful for distinguishing a relevant document from a non-relevant one

COMP 307 1:6

Page 7: Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

Term weights: TF.IDF• Term weight Xi = TF * IDF• TF: Term Frequency

• IDF: Inverse Document Frequency

ngthdocumentLetermCountTF

untdocumentCocumentsTotalNumDoIDF log

Page 8: Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

An exampleDocument1: cat eat mouse, mouse eat cocolate

Document 2: cat eat mouse

Document 3: mouse eat chocolate mouse

Indexing terms:Vector representation for each document

Query: catVector representation of query

Which document is more relevant?

Page 9: Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

example

Page 10: Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

Evaluation• Test collection

• TREC• Parameters

• Recall: Percentage of all relevant documents that are found by a search

• Precision: Percentage of retrieved documents that are relevant

ionsInCollectlevantItemoftrievedslevantItemofR

Re#ReRe#

trievedofItemtrievedslevantItemofP

Re#ReRe#

Page 11: Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

Discussion• How to compare two IR systems

• Which is more important in Web search: precision or recall?

• What leads to Google’s success for IR