Upload
thomas-brooks
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Special Topics in Computer ScienceSpecial Topics in Computer Science
The Art of Information RetrievalThe Art of Information Retrieval
Chapter 3: Retrieval EvaluationChapter 3: Retrieval Evaluation
Alexander Gelbukh
www.Gelbukh.com
2
Previous chapterPrevious chapter
Modeling is needed for formal operations Boolean model is the simplest Vector model is the best combination of quality and
simplicityo TF-IDF term weighting
o This (or similar) weighting is used in all further models
Many interesting and not well-investigated variationso possible future work
3
Previous chapter: Research issuesPrevious chapter: Research issues
How people judge relevance?o ranking strategies
How to combine different sources of evidence? What interfaces can help users to understand and
formulate their Information Need?o user interfaces: an open issue
Meta-search engines: combine results from different Web search engineso They almost do not intersect
o How to combine ranking?
4
Evaluation!Evaluation!
How do you measure if your system is good or bad? To go to the right direction, need to know where you
want to get to. “We can do it this way” vs. “This way it performs better”
o I think it is better...
o We do it this way...
o Our method takes into account syntax and semantics...
o I like the results...
Criterion of truth. Crucial for any science. Enables competition financial policy attracts people
o TREC international competitions
5
MethodologyMethodology
Define formally your task and constraints Define formally your evaluation criterion (argue if
needed)o One numerical value, not several!
Demonstrate that your method gives better value thano the baseline (the simple obvious way)
Retrieve all. Retrieve none. Retrieve at random. Use Google.
o state-of-the-art (the best reported method) That your parameter settings are optimal
o Consider singular (extreme) settingso Set your parameters to 0. To infinity.
6
MethodologyMethodology
The only valid way of reasoning “But we want the clusters to be non-trivial”
o Then add this as a penalty to your criteria or as constraints Divide your “acceptability considerations”:
o Constraints: yes/no. o Evaluation: better/worse.
Check that your evaluation criteria are well justifiedo “My formula gives it this way”o “My result is correct since this is what my algorithm gives”o Reason in terms of the user task, not your algorithm / formulas
Are your good/bad judgments in accord with intuition?
7
Evaluation?Evaluation?
IR: “user satisfaction”o Difficult to model formally
o Expensive to measure directly (experiments with subjects)
At least two contradicting parameterso Completeness vs. quality
o No good way to combine into one single numerical value
o Some “user-defined” “weights of importance” of the two Not formal, depend on situation
Art
8
Parameters to evaluateParameters to evaluate
Performance evaluationo Speed
o Space Tradoff
o Common for all systems. Not discussed here.
Retrieval performance (quality?) evaluationo = goodness of a retrieval strategy
o A test reference collection: docs and queries.
o The “correct” set (or ordering) provided by “experts”
o A similarity measure to compare system output with the “correct” one.
9
Evaluation: Model User SatisfactionEvaluation: Model User Satisfaction
User tasko Batch query processing? Interaction? Mixed?
Way of useo Real-life situation: what factors matter?
o Interface type
This chapter: laboratory settingso Repeatability
o Scalability
10
Precision & RecallPrecision & Recall
Tradeoff (as with time and space) Assumes the retrieval results are sets
o as Boolean; in Vector use threshold Measures closeness between two sets Recall:
Of relevant docs, how many (%) were retrieved?Others are lost.
Precision:Of retrieved docs, how many (%) are relevant?Others are noise.
Nowadays with huge collections Precision is more important!
11
Precision & RecallPrecision & Recall
Recall =
Precision =
||
||
R
Ra
||
||
A
Ra
12
Ranked Output...Ranked Output...
“Truth”: unordered “relevant” set Output: ordered guessing Compare ordered set with an unordered one
13
...Ranked Output...Ranked Output
Plot precision vs. recall curve In the initial part of the list containing n% of all
relevant docs, what the precision is?o 11 standard recall levels: 0%, 10%, ..., 90%, 100%.
o 0%: interpolated
14
Many experimentsMany experiments
Average precision and recall
Ranked output: Average precision at each recall level To get equal (standard) recall levels, interpolation
o of 3 relevant docs, there is no 10% level!
o Interpolated value at level n =maximum known value between n and n + 1
o If none known, use the nearest known.
15
Precision vs. Recall FiguresPrecision vs. Recall Figures
Alternative method: document cutoff valueso Precision at first 5, 10, 15, 20, 30, 50, 100 docs
Used to compare algorithms.o Simple
o Intuitive
NOT a one-value comparison!
Which one is better?
17
Single-value summariesSingle-value summaries
Performance for an individual queryo Can be averaged over several queries, too
o Histogram for several queries can be made
o Tables can be made
o Curves cannot be used for this!
Precision at first relevant doc? Average precision at (each) seen relevant docs
o Favors systems that give several relevant docs first
R-precisiono precision at Rth retrieved (R = total relevant)
Precision histogram
Two algs: A, B
R(A)-R(B).
Which is better?
19
Alternative measuresAlternative measures
Problems with Precision & Recall measure:o Recall cannot be estimated with large collections
o Two values, but we need one value to compare
o Designed for batch mode, not interactive. Informativeness!
o Designed for linear ordering of docs (not weak ordering)
Alternative measures: combine both in one
F-measure: E-measure: user preference Rec vs. Prec
User-oriented measuresUser-oriented measuresDefinitions:
21
User-oriented measuresUser-oriented measures
Coverage ratioo Many expected docs
Novelty ratioo Many new docs
Relative recall: # found / # expected Recall effort: # expected / # examined until those are found
Other: o expected search length (good for weak order)o satisfaction (considers only relevant docs)o frustration (considers only non-relevant docs)
22
Reference collectionsReference collections
Texts with queries and relevant docs known
TREC Text REtrieval Conference. Different in different years Wide variety of topics. Document structure marked up. 6 GB. See NIST website: available at small cost Not all relevant docs marked!
o Pooling method:
o top 100 docs in ranking of many search engines
o manually verified
o Was tested that is a good approximation to the “real” set
23
...TREC tasks...TREC tasks
Ad-hoc (conventional: query answer) Routing (ranked filtering of changing collection) Chinese ad-hoc Filtering (changing collection; no ranking) Interactive (no ranking) NLP: does it help? Cross-language (ad-hoc) High precision (only 10 docs in answer) Spoken document retrieval (written transcripts) Very large corpus (ad-hoc, 20 GB = 7.5 M docs) Query task (several query versions; does strategy depends on it?)
Query transformingo Automatic
o Manual
24
...TREC evaluation...TREC evaluation
Summary table statisticso # of requests used in the tasko # of retrieved docs; # of relevant retrieved and not retrieved
Recall-precision averageso 11 standard points. Interpolated (and not)
Document level averageso Also, can include average R-value
Average precision histogramo By topic.o E.g., difference between R-precision of this system and
average of all systems
25
Smaller collectionsSmaller collections
Simpler to use Can include info that TREC does not Can be of specialized type (e.g., include co-citations) Less sparse, greater overlap between queries Examples:
o CACM
o ISI
o there are others
26
CACM collectionCACM collection
Communications of ACM, 1958-1979 3204 articles Computer science Structure info (author, date, citations, ...) Stems (only title and abstract)
Good for algorithms relying on cross-citationso If a paper cites another one, they are related
o If two papers cite the same ones, they are related
52 queries with Boolean form and answer sets
27
ISI collectionISI collection
On information sciences 1460 docs For similarity in terms and cross-citation Includes:
o Stems (title and abstracts)
o Number of cross-citations
35 natural-language queries with Boolean form and answer sets
28
Cystic Fibrosis (CF) collectionCystic Fibrosis (CF) collection
Medical 1239 docs MEDLINE data
o keywords assigned manually!
100 requests 4 judgments for each doc
o Good to see agreement
Degrees of relevance, from 0 to 2 Good answer set overlap
o can be used for learning from previous queries
29
Research issuesResearch issues
Different types of interfaces; interactive systems:o What measures to use?
o Such as infromativeness
30
ConclusionsConclusions
Main measures: Precision & Recall.o For sets
o Rankings are evaluated through initial subsets
There are measures that combine them into oneo Involve user-defined preferences
Many (other) characteristicso An algorithm can be good at some and bad at others
o Averages are used, but not always are meaningful
Reference collection exists with known answers to evaluate new algorithms
31
Thank you!
Till October 9October 23: midterm exam