Upload
alicia-carroll
View
216
Download
0
Embed Size (px)
DESCRIPTION
Information Retrieval (IR) The field of information retrieval deals with the representation, storage, organization of, access to information items. Search Engines: Search a large collection of documents to find the ones that satisfy an information need, e,g, find relevant documents Indexing Document representation Comparison with query Evaluation/feedback
Citation preview
Xiaoying GaoComputer Science
Victoria University of Wellington
COMP307NLP 4
Information Retrieval
Menu• Information Retrieval (IR) basics• An example• Evaluation: Precision and Recall
Information Retrieval (IR)The field of information retrieval deals with the
representation, storage, organization of, access to information items.
Search Engines:
Search a large collection of documents to find the ones that satisfy an information need, e,g, find relevant documents
• Indexing• Document representation• Comparison with query• Evaluation/feedback
IR basics• Indexing
• Manual indexing: using controlled vocabularies, e,g, libraries, early version of Yahoo
• Automatic indexing: indexing program assigns keywords, phrases or other features, e.g. words from text of document
• Popular Retrieval Models
• Boolean: exact match• Vector Space: best match
• Citation analysis models: best match
• Probabilistic models: best match
COMP 307 1:4
Vector space Model Any text object can be represented by a term vector.
Similarity is determined by distance in a vector space.
ExampleDoc1: 0.3, 0.1, 0.4Doc2: 0.8, 0.5, 0.6
Query: 0.0, 0.2,0.0
Vector Space Similarity: Cosine of the angle between the two vectors
n
ii
n
ii
n
i
ii
YX
YX
1
2
1
2
1
COMP 307 1:5
Text representationTerm weightsTerm weights reflect the estimated importance of each
term
The more often a word occurs in a document, the better that term is in describing what the document is about
But terms that appear in many documents in the collection are not very useful for distinguishing a relevant document from a non-relevant one
COMP 307 1:6
Term weights: TF.IDF• Term weight Xi = TF * IDF• TF: Term Frequency
• IDF: Inverse Document Frequency
ngthdocumentLetermCountTF
untdocumentCocumentsTotalNumDoIDF log
An exampleDocument1: cat eat mouse, mouse eat cocolate
Document 2: cat eat mouse
Document 3: mouse eat chocolate mouse
Indexing terms:Vector representation for each document
Query: catVector representation of query
Which document is more relevant?
example
Evaluation• Test collection
• TREC• Parameters
• Recall: Percentage of all relevant documents that are found by a search
• Precision: Percentage of retrieved documents that are relevant
ionsInCollectlevantItemoftrievedslevantItemofR
Re#ReRe#
trievedofItemtrievedslevantItemofP
Re#ReRe#
Discussion• How to compare two IR systems
• Which is more important in Web search: precision or recall?
• What leads to Google’s success for IR