35
Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin, Göteborgs Universitet Opponent: Prof Dr Stefan Schulz Freiburg University (Germany)

Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Embed Size (px)

Citation preview

Page 1: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Resolving Power of Search Keysin MedEval, a Swedish Medical Test

Collection with User Groups: Doctors and Patients

PhD thesis by Karin Friberg Heppin, Göteborgs Universitet

Opponent:

Prof Dr Stefan Schulz Freiburg University (Germany)

Page 2: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

‘Resolving Power’ of Search Keysin MedEval, a Swedish Medical Test

Collection with User Groups: Doctors and Patients

PhD thesis by Karin Friberg Heppin, Göteborgs Universitet

Opponent:

Prof Dr Stefan Schulz Freiburg University (Germany)

Page 3: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Background

• More than fifty years of research in Information Retrieval (IR)

• Importance of IR as a key technology for dealing with large

amounts of information in the era of the Internet

• Most IR research is done in English content and standard texts

(mostly newswire)

• Specific issues in

• Other languages: Swedish

• Sublanguages / language registers: Medicine

Page 4: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Example

• Single-word compounds are common in Swedish texts (10%)

• Compare two search expressions:

(1) “narkotikapolitik”

(2) “fotboll”

• What happens if used in a trivial IR setting?

• Does a search for (1) retrieve enough relevant documents?

• If single-word compounds are decomposed and the component

used as search keys?

• Does a search for “fot” AND “boll” still yield relevant documents?

• Or does it only add noise?

Page 5: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Focus of thesis

• Resource production

• a medical test document collection in Swedish

• IR research questions

• What are the characteristics of good search keys in general ?

• Can professional language characteristics be used for

optimizing target-group specific searches?

• Are compounds good search keys or is it better to use their

constituents as search keys?

Page 6: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Organization of thesis

IR background: exhaustive review of the state of the art:models, evaluation, linguistics, medical IR

Test environment: tools, resources, creation of MedEval test collection

Pilot studies: investigation of the behavior of terms and groups of terms; analysis of the patients and doctor documents

Literature / Appendix

259pp.

Page 7: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Main hypothesis

The resolving power of search keys is dependent on their

frequency in the document collection

This should guide the decision whether to use decomposition of

single-word compounds

Page 8: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

How is the hypothesis being validated?

• Creation of a medical test collection

• Running IR experiments

• pilot study

• manual inspection and error analysis

Page 9: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Creation of a Test Collection

• Subset of MedLex with medical texts from different sources

totalling 42000 documents

• Two indexes

• original tokens (e.g. “saltkoncentration”)

• tokens and constituents of compounds (e.g. “saltkoncentration”,

“salt”, “koncentration”

• Document processing: tokenization, lemmatization, non-

lexical decomposition

Page 10: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Tools used

• Indri search engine:

• inference network approach

• produces ranked output

• complex query syntax (several proximity parameters, individual

weighting, Boolean AND)

• TREC eval: evaluation toolkit

• Query performance analyzer

Page 11: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Example Query Performance Analyzer

Page 12: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Topic collection and relevance assessment

• 62 topics were acquired by medical students

• relevance assessment done by pooling (suboptimal but only

feasible strategy with given resources):

• interactive searching and judging

(four runs * pool depth 100)

• four grades of relevance judgements

• judgements of target readers: patients vs. physicians vs. both

• adjusted relevance scores

Page 13: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Six different scenarios

Page 14: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Creating baseline queries

• Division of the terms of a query into facets (conceptual

aspects) in order to assess the impact of query components

• Using words of the topics + Swedish MeSH synonyms

• Example for facet:

TREATMENT = #syn(behandla behandling strategi

behandlingsstrategi behandlingsmetod

behandlingsalternativ tillvägagångssätt genomföra)

• Parameter for assessment: Normalized discounted

cumulative gain (nDCG)

Page 15: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Analysis of the contribution of

• words

• word fragments

• facets

to the query performance (resolving power)

measured in nDCG (normalized discounted

cumulated gain)

Page 16: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Test of the suitability of single terms

nDCG

Page 17: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Measuring resolving power by removing facets

retrieves noisenDCG

Page 18: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Recall vs. noise

Page 19: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Quality of search keys (dependent on topic)

ineffective keys

effective keys

Page 20: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,
Page 21: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Conclusions

• Ineffective search keys are more likely to be found among

terms with very high and very low frequency (statistical

significant), but the effect is not very strong effect and there

are important exceptions

• Low frequency compounds can benefit from decomposition

• Only split compounds if the constituents have greater

resolving power than the compound

• If the compound has a head – modifier structure only use the

head, as the modifier is supposed to have a low resolving

power

• No clear message in which sense professional language

characteristics can be used for optimizing target-group

specific searches

Page 22: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Questions to the candidate

Page 23: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Question 1

It is interesting that early IR work you cited referred not to the

parameter pair precision/recall, but specificity/sensitivity.

Whereas sensitivity = recall = rel docs found / rel

docs

specificity = nrel docs

found / nrel docs

precision =

rel docs found / found docs

Is there a reason why the mainstream IR research abandoned

specificity? Is precision really an unproblematic parameter?

Do you know recent work that reintroduced specificity in IR?

Page 24: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Question 2

The F-measure allows to assign different weights to precision

and recall. This is important when different user scenarios are

to be studied. There are scenarios in which recall is more

important and the user accept noise because no relevant

document must be missed. Would it be possible to express

these user scenarios using nDCG?

Page 25: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Question 3

You used normalized discounted cumulative gain (nDCG) for

measuring the resolving power of search expressions

Why did you chose this parameter (and not the widely used F-

measure)?

Page 26: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Question 4

You describe a Swedish stemmer that produces a 15 percent

increase in precision. Stemmers are supposed to increase

recall. How can they increase precision?

Page 27: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Question 5

I am surprised that a lexicon-free compound splitter, guided by

indicative consonant sequences, yielded relatively good results.

Do these results also hold for neoclassical compounds, which

are typical for medicine (e.g. “cerebrovaskulär“)?

Could the performance have been improved if you had

combined it at least with a basic lexicon of medical terms?

Page 28: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Question 6

Could you explain a bit more the difference between a

synonym set and a facet? If you include hyponyms what does

this mean with compounds? Does this mean that a facet for

cancer would look like this:

CANCER = (cancer bröstcancer lungcancer pankreascancer

kolorektalcancer tjocktarmscancer magsäckscancer …)

Page 29: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Question 7

It seems that the conclusions re the two language registers

studies are less obvious compared to the analysis of the

effectiveness of decomposition of compounds.

What had you originally expected from the distinction between

registers and what could still be done to achieve the expected

result?

Page 30: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Question 8

The conclusions of your “pilot” studies are based on a careful

dissection of the elements of topic descriptions in order to pick

out characteristic features the terms to assess their usefulness.

Nevertheless what you found out is still hypothetic.

How could an empirical study be devised that provides stronger

evidence of the validity of your conclusions?

Page 31: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Question 9

On page 34 you address reliability: why does reliability matter

and how can you measure it? Has reliability be assessed in the

construction of your test collection?

Page 32: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Question 10

Your analysis of multiword units seems a bit off-topic. Could you

explain the rationale for this investigation and why this matters

for the discussion of language registers?

Page 33: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Question 11

Page 34: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

Your conclusion that single-word compounds should be treated

in a differentiated way (according to semantic and statistic

criteria) seems not so original because this is already obvious

when observing the behaviour of Web search engines.

Which related work do you know that addressed this issues,

which are their conclusions and what is the specific

contribution of your own research?

Question 12

Page 35: Resolving Power of Search Keys in MedEval, a Swedish Medical Test Collection with User Groups: Doctors and Patients PhD thesis by Karin Friberg Heppin,

The test collection you have built is certainly a valuable

resource for further research. Which kind of research can you

imagine or would you like to be seen using your test collection?

Question 13