Upload
jade-gregory
View
219
Download
0
Embed Size (px)
DESCRIPTION
Word sense disambiguation The different meanings of polysemous words are known as “senses” and the process of deciding which is being used in a particular context “word sense disambiguation”
Citation preview
Zdroje jazykových dat
Word sensesSense tagged corpora
• Lev V. Ščerba: And indeed, every sufficiently complex word must actually become the subject of a scientific monograph; therefore it is hard to expect in the near future the completion of a good dictionary.
Word sense disambiguation
• The different meanings of polysemous words are known as “senses” and the process of deciding which is being used in a particular context “word sense disambiguation”
Lexical Acquisition Bottleneck
• In NLP many systems do not perform in practice as well as they could with adequate dictionary resources, due to the cost of production, adaptation, and maintenance of these resouces
• Solutions– Reusing existing dictionaries and ontologies as
lexicons– Deriving disambiguation information directly from
corpora
Usefulness of WSD
• NLP tools:– Systems – carries out some task of “interest for its
own sake” (e.g. MT,IR); applications potentially interesting for non-linguists
– Components – interesting for linguists and language engineers; e.g. WSD
Early approaches• Preference semantics – 1970’s
– Selectional constraints (e.g. ANIMATE for subject of “to drink”)
• Word experts – 1980’s– Hand crafted disambiguators constructed for each word
separately– Limited applicability
• Polaroid words– Gradual disambiguation (grammar, parser, lexicon, semantic
interpreter, knowledge representation language)
Dictionary Based Approaches
• Since 1980’s – dictionary publishers started to produced “Machine Readable Dictionaries” (now - m. tractable d.)
• Wider polysemy than in the systems described so far
Two claimsabout sense distribution
• One sense per discourse– There is a very strong tendency for multiple uses of a
word to share the same sense in a well-written discourse
• One sense per collocation– With a high probability an ambiguous word has only
one sense in a given collocation
Taxonomy of WSD Algorithms
• Knowledge based• Corpus based
– Tagged corpora– Untagged corpora
• Hybrid approaches
Word Senses and Lexicons
Sense tagging = attaching senses from some lexicon to words in text
Sense-enumerative dictionary
Deficiencies of dictionaries
• Omissions and oversights• Coverage of names• Ghost words – Dord=density (D or d)• Differentiating senses (P.Hanks: A serious problem for
computer applications if that dictionaries compiled for human users focus on giving lists of meanings for each entry, without saying much about how one meaning may be distinguished from another in text)
Two levels of sense distinction
• Homography– Two senses of a word are homographic when there
is no obvious semantic relation between them (e.g. a ball – a dance or a rounded object)
– Risk of amateur etymology
• Polysemy
Distinguishing senses
• P.Hanks: No generally agreed criteria exist for what counts as a sense, or for how to distinguish one sense from another
• Zeugma: Arthur and his driving license expired last Thursday.
• Polysemy vs. vagueness (e.g. mountain)
The Bank Model• Assumption A – Words have a finite set of clearly distinct,
well-defined sense
• Assumption B – Native speakers of … know instantly and effortlessly which meaning applies in a given situation
• Criticism of the bank model: Kilgarriff (“I don’t believe word senses”), Pustejovsky (Generative lexicon), and many others…
NLP Lexicons
• Longman Dictionary of Contemporary English (LDOCE) – three-level embedded structure for sense distinctions (homographs,senses,optional subsenses)
• Roget’s Thesaurus• Cambridge International Dictionary of English• COBUILD English Language Dictionary• WordNet
Thesaurus
Ontology
Ontology• There is little agreement on what an ontology is… In
general, an ontology can be described as an inventory of the objects, processes, etc. in a domain, as well as a specification of (some of ) the relation that hold among them.
• Aristotle: genus (category to which something belongs)and differentiae (property that uniquely distinguish the category member from their parent and from one another)
• Nodes (concepts) in the hierarchy related by subsumption
Ontologies in different traditions
• Philosophical• Cognitive • Artificial intelligence• Lexical semantics• Lexicography• Information science
Princeton WordNet• Lexical semantic network structured around the notion of
synsets• Synset - skupina literálů téhož slovního druhu, které jsou v určitém kontextu vzájemně
zaměnitelné („set of synonyms“)
• http://www.cogsci.princeton.edu/~wn/w3wn.html• Inspired by psycholinguistic theories of human lexical memory• broad coverage, rich lexical information, freely available• too fine-grained for practical NLP tasks• Relations between two synsets: homonymy, hyperonymy,
meronymy …
EuroWordNet (i)
• Multilingual database containing several monoloingual wordnets structured along the same lines as the Princeton WordNet1.5
• English,Dutch,German,Spanish,French,Italian, Czech,Estonian
• Inter-Lingual-Index• http://www.hum.uva.nl/~ewn
EuroWordNet (ii)
Princeton WordNet 1.5 EuroWordNet
note, observe, make a remark,
remark
prohodit, poznamenat,připomenout
anmerken,bemerken
. . . . . .. . . . . .
. . . . . .
Sense tagged corpora• “interest” corpus
– 2kS containing the word “interest” • SENSEVAL
– http://www.senseval.org– WSD evaluation exercise, first run in 1998
• SEMCOR– http://multisemcor.itc.it/semcor.phpSubset of the English Brown corpus,700kW– More than 200kW sense-tagged according to Princeton
WordNet 1.6
Final remarks
• Similarity of POS- and sense tagging• Mapping lexical resources