32
Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK http://www.dcs.shef.ac.uk/~marks Joint work with: Yikun Guo and Robert Gaizauskas (University of Sheffield) and David Martinez (University of Melbourne)

Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Embed Size (px)

Citation preview

Page 1: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Disambiguation of Biomedical Text

Mark StevensonNatural Language Processing Group

University of Sheffield, UK

http://www.dcs.shef.ac.uk/~marks

Joint work with:

Yikun Guo and Robert Gaizauskas (University of Sheffield)

and David Martinez (University of Melbourne)

Page 2: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Outline

• Ambiguity in biomedical documents

• Disambiguation– Knowledge sources

• Evaluation

• Semi-supervised acquisition of additional training data

Page 3: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Text in Biomedical Domain

• The literature on biomedicine and the life sciences is vast and growing rapidly

• Promising domain for text processing

• Search engines necessary

• Opportunities for knowledge discovery

Page 4: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Ambiguity• Lexical ambiguity makes text processing more

difficult• Generally believed that ambiguities do not occur

with domains– One Sense per Discourse (Gale, Church and

Yarowsky, 1992)– “there is a very strong tendency (98%) for multiple

uses of a word to share the same sense in a well-written discourse”

Page 5: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

cell

Page 6: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

culture “In peripheral blood mononuclear

cell culture streptococcal erythrogenic toxins are able to stimulate tryptophan degradation in humans.”

International Allergy Immunology

“The aim of this paper is to describe the origins, initial steps and strategy, current progress and main accomplishments of introducing a quality management culture within the healthcare system in Poland.”

International Journal of Qualitative Health Care

Page 7: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Extent of Ambiguity Problem

• Weeber et. al. (2001)• Estimated that 11.7% of the phrases in abstracts added

to MEDLINE in 1998 were ambiguous

• Ambiguity is biggest challenge in automation of indexing MEDLINE and a hindrance to automated knowledge discovery (Weeber et. al. 2001)(Nadkarin et. al. 2001)(Aronson 2001)

Page 8: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

WSD System

• Supervised learning approach• Extension of Basque Country University’s

Senseval-3 system (Agirre and Martinez, 2004)

• Combines range of knowledge sources• Previous work shown that combining

knowledge sources is an effective approach to WSD

Page 9: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Features

1. General• Wide range of features which are commonly

used by WSD systems

2. Domain specific• Two knowledge sources specific to biomedical

domain

Page 10: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Example• “Body surface area adjustments of initial heparin

dosing …”

1. Individual Adjustment“By the fast (2.5mph) ambulation trial, both groups were performing equally, suggesting a rapid rate of adjustment to the device.”

2. Adjustment Action“Clinically, these four patients had mild symptoms which improved with dietary adjustment.”

3. Psychological adjustment“Predictors of patients' mental adjustment to cancer: patient characteristics and social support.”

Page 11: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

General Features (1)• Local collocations

• Bigrams and trigrams containing ambiguous word constructed from lemmas, word forms and PoS tags• left-content-word-lemma “area adjustment”• right-function-word-lemma “adjustment of'' • left-POS “NN NNS”• right-POS “NNS IN” • left-content-word-form “area adjustments”• right-function-word-form “adjustment of”

• First noun, verb, adjective and adverb preceding and following ambiguous word (lemma and word form)

Page 12: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

General Features (2)

• Syntactic dependencies• Five relations: subject, object, noun-modifier, preposition

and sibling

• Salient bigrams• Salient bigrams in abstract

• Unigrams• Lemmas of all content words in the abstract and 8 word

window around target word• Lemmas of unigrams which appear frequently in entire

corpus

Page 13: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Concept Unique Identifiers (CUIs)• CUIs refer to UMLS concepts• MetaMap segments text and identifies possible

CUIs for each phrase

"Body surface area adjustments"

C0005902:Body Surface Area [Diagnostic Procedure]

C1261466:Body surface area [Organism Attribute]

C0456081:Adjustments (Adjustment Action) [Health Care Activity]

C0376209:Adjustments (Individual Adjustment) [Individual Behavior]

"of initial heparin dosing"

C0205265:Initial (Initially) [Temporal Concept]

C1555582:initial [Idea or Concept]

C0019134:Heparin [Biologically Active Substance,Carbohydrate]

Page 14: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Medical Subject Headings (MeSH)• Controlled vocabulary for indexing life science

publications• Contains over 24,000 headings organised into an

11 level hierarchy• Use MeSH terms assigned to abstract containing

ambiguous term

M01.060.116.100: “Aged” M01.060.116.100.080: “Aged, 80 and over”D27.505.954.502.119: “Anticoagulants”G09.188.261.560.150: “Blood Coagulation”

Page 15: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Learning Algorithms

1. Vector Space Model• Simple memory-based learning algorithm

2. Naïve Bayes

3. Support Vector Machine• Weka implementations

Page 16: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

NLM-WSD data set

• Standard evaluation corpus for WSD in biomedical domain (“Biomedical SemEval”)

• Contains highly 50 ambiguous terms frequently found in Medline

• 100 instances of each term manually disambiguated with UMLS concepts by a team of annotators

• Baseline (MFS) accuracy of 78%• Average of 2.64 possible meanings per term

Page 17: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Results

General CUI MeSH CUI+

MeSH

Ling +

MeSH

Ling + CUI

All

VSM 87.0 85.8 81.9 86.9 87.9 87.3 87.5

NB 86.4 81.2 85.7 81.1 86.4 81.7 81.8

SVM 85.9 83.5 85.3 84.5 86.2 85.3 86.0

• Combination of linguistic features with MeSH terms significantly better than any features used alone

• VSM significantly better than other learning algorithms

Page 18: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

cold

depression

discharge

extraction

fat

implantation

japanese

lead

mole

pathology

reduction

sex

ultrasound

degree

growth

man

mosaic

nutrition

repair

scale

weight

white

adjustment

blood pressure

evaluation

immunosuppression

radiation

sensitivity

association

condition

culture

determination

energy

failure

fit

fluide

frequency

ganglion

glucose

inhibition

pressure

resistance

secretion

single

strains

support

surgery

transient

transport

variation

Liu et. al. (2004) Leroy and Rindflesch (2005)Joshi et. al. (2005)

Common

Dominant sense < 90%Removed low IAA

Dominant sense < 65%

Page 19: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Approach

MFS Liu

et. al. (2004)

Leroy & Rindflesch

(2005)

Joshi

et. al. (2005)

McInnes

et. al.

(2007)

Reported

(General +

MeSH)

All words 78.0 85.3 87.9

Joshi 66.9 82.5 80.0 83.3

Leroy 55.3 65.5 77.4 74.5 79.7

Liu 69.9 78.0 84.9 82.0 84.8

Common 54.9 68.8 79.8 75.7 81.1

Page 20: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Automatic Example Generation

• Various approaches to generating sense tagged examples without the need for manual annotation• Monosemous relatives (Leacock et. al. 1998) • Translations as sense definitions (Ng et. al. 2003)

• All unsupervised but require external knowledge sources (e.g. WordNet or parallel text)

• Alternative semi-supervised approach

Page 21: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

~~~~~~~~~~~~~~~~~~~~

Relevance Feedback

• Method for improving search results based on analysis of retrieved documents

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Retrieveddocuments

Relevance judgements

QueryModifiedQuery

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>>

Page 22: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

• Common approach to relevance feedback for vector space model (Rocchio, 1971)

qm = modified query vector q = original query vectorD+q = set of vectors representing known relevant documentsD-q = set of vectors representing known irrelevant documentsα,β,γ = weights

Page 23: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Acquiring Sense Tagged Examples

• Treat set of sense tagged examples as retrieved documents• Examples tagged with sense considered relevant, all

other examples considered irrelevant

• For each sense, identify additional query terms which tend to discriminate examples tagged with that sense from those tagged with other senses

• Search for documents matching this extended query

Page 24: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Identifying Query Terms

count(t,d) = frequency of term t in document dD+s = set of examples of target sense D-s = set of examples of other sensesα,β = weights

• Compute score for each term in the sense-tagged documents against each sense

idf(t) = inverse document frequency of t

Page 25: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Terms for two senses of “culture”

‘anthropological culture’ ‘laboratory culture’

cultural 26.17 suggest 6.32

recommendation 14.82 protein 6.13

force 14.80 presence 5.86

ethnic 14.79 demonstrate 5.86

practice 14.76 analysis 5.78

man 14.76 gene 5.58

Page 26: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Example Collection

• Identify examples by querying Medline via online interface

• Preserve bias in original sense distribution• For example, if 75% usages are ‘laboratory culture’ and 25%

‘anthropological culture’ then ensure same 75:25 split in retrieved examples

• Use eight highest scoring terms (score(t,s)) for each sense

• Relax queries until enough examples can be retrieved:culture AND (suggest AND protein AND presence)culture AND ((suggest AND protein) OR (suggest AND presence) OR

(protein and presence))culture AND (suggest OR protein OR presence)

Page 27: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Experiments

• 10-fold cross validation• Training portion (90 examples) analysed to generate

additional examples• Generated three sets for each term: 90, 180, 270 and

360 examples

• Combine automatically generated examples with training portion (+90, +180, +270, +360)

• Automatically generated examples alone (90, 180, 270, 360)

Page 28: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Performance

Basic87.9

Combined

+90 +180 +270 +360

89.6 88.6 88.0 88.0

Additional only

90 180 270 360

88.4 87.9 87.5 87.3

Page 29: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Individual Terms

term basic 90 difference

blood pressure 53 66 13

reduction 88 96 8

repair 86 92 6

mole 88 94 6

ultrasound 88 94 6

white 81 72 -9

weight 82 71 -11

degree 93 81 -12

evaluation 81 69 -12

Page 30: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

Conclusion

• Ambiguity real problem in biomedical domain

• Domain specific knowledge improves WSD performance

• Relevance feedback can be used to acquire additional training examples and further improve performance

Page 31: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work

More Information

• This work has been funded EPSRC grants BioWSD and CASTLE

http://nlp.shef.ac.uk/BioWSD/

Page 32: Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK marks Joint work