Ir 09

Human Computer Interaction

Lecture 09Information Retrieval

The Vector Space Model for ScoringVariant tf-idf Functions




Maximum tf normalization does suffer from the following issues:

The method is unstable in the following sense: a change in the stop word list can dramatically alter term weightings (and therefore ranking). Thus, it is hard to tune.

A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of that document.

More generally, a document in which the most frequent term appears roughly as often as many other terms should be treated differently from one with a more skewed distribution.

The Vector Space Model for ScoringDocument and Query Weighting Schemes



Evaluation in information retrieval

How do we know which of the previously discussed IR techniques are effective in which applications?

Should we use stop lists?

Should we stemming?

Should we use inverse document frequency weighting?


To measure ad hoc information retrieval effectiveness in the standard way, we need a test collection consisting of three things:

Document collection

A test suite of information needs, expressible as queries

A set of relevance judgments, standardly a binary assessment of either relevant or nonrelevant for each query-document pair.


A document in the test collection is given a binary classification as either relevant or nonrelevant.

This decision is referred to as the gold standard or ground truth judgment of relevance.

Relevance is assessed relative to an information need, not a query.

This means that: a document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.


Information Need:

Query = Jaguar


Information Need:

Query = Jaguar Speed


Information Need:

Query = Speed of Jaguar


Information Need:

Query = Speed of Jaguar animal


Information Need:

Query = Speed of impala


Information Need:

Query = Speed of impala animal


Information Need:

Query = impala

Evaluation in information retrievalStandard Test Collections

Information Need:

Cranfield Collection: Abstracts of 1398 articlesA set of 225 queries, and their respective relevance judgments.

TREC (Text Retrieval Conference):6 CDs containing 1.89 million documentsRelevance judgments for 450 information needs, which are called topics and specified in detailed text passages.

CLEF (Cross Language Evaluation Forum):This evaluation series has concentrated on European languages and cross-language information retrieval.

Evaluation in information retrievalPrecision/Recall . . . Revisited

Evaluation in information retrievalPrecision/Recall . . . Revisited

Offline Evaluation a.k.a. Manual Judgments

Ask experts or users to explicitly evaluate your retrieval system.

Online Evaluation Observing Users

See how normal users interact with your retrieval system when just using it.

Evaluation in information retrievalGeneral Types of Evaluation

Offline Evaluation involves:

Select queries to evaluate on

Get results for those queries

Assess the relevance of those results to the queries

Compute your offline metric


Online Evaluation involves:

Capture users search behavior:

Search queriesResults and Clicks Mouse Movement

Assess the relevance of those results to the queries

Compute your offline metric


Education

Ir 09