View
31
Download
0
Category
Tags:
Preview:
DESCRIPTION
Annotating the WordNet Glosses Ben Haskell . Annotating the Glosses. Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging) - PowerPoint PPT Presentation
Citation preview
Annotating the WordNet Glosses
Ben Haskell <ben@clarity.princeton.edu>
2004/10/08
Annotating the Glosses
• Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)
• A disambiguation task: Process of linking an instance of a word to the WordNet synset representing its context-appropriate meaning, e.g.
run a company vs. run an errand
2004/10/08
{ run#29 }, v
{ control, command } { change, alter, modify }
{ end, terminate }
{ complete, finish }
{ carry_through, accomplish, exceute, carry_out, action, fulfil, fulfill }
{ make, create }
{ cause, do, make }
{ effect, effectuate, bring_about, set_up }
{ manage, deal care, handle }
{ direct }
{ run#12, operate }, v . . . . . . . . .
. . . run a company . . . . . . run an errand . . .
carry out;“run an errand”
direct or control; project, businesses, etc.“She is running a relief operation in theSudan”
2004/10/08
Glosses as node points in the network of relations
• Once a word’s gloss is annotated, the synsets for all conceptually-related words used in the gloss can be accessed via their sense tags
• Situates the word in an expanded network of links to other semantically-related words/concepts in WordNet
2004/10/08
{ step }{ move }
{ dance#2 }, v
move in a graceful and rhythmical way;
IS-AENTAIL
DERIV
DERIVdancer#1
social_dancer
dancer#2professional_dancer
2004/10/08
{ step }{ move }
{ dance#2 }, v
move in a graceful and rhythmical way;
IS-AENTAIL
DERIV
DERIVdancer#1
social_dancer
dancer#2professional_dancer
2004/10/08
{ step }{ move }
awkward
{ graceful#1 }, a
{ dance#2 }, v
move in a graceful and rhythmical way;
IS-AENTAIL
DERIV
DERIV
ANT
deft
elegant
liquidfluent
fluid
SIM
SIM
SIM
gainly. . .
SIM
dancer#1
social_dancer
dancer#2professional_dancer
2004/10/08
{ step }{ move }
awkward
{ graceful#1 }, a
{ dance#2 }, v
move in a graceful and rhythmical way;
IS-AENTAIL
DERIV
DERIV
ANT
{ rhythmical#1 }, a
{ way#8 }, n
manner
mode
style
fashion
unrhythmical
ANT
beatingpulsating
pulsing
SIM
cadenced
cadent
SIM SIM
danceable
. . .
deft
elegant
liquidfluent
fluid
SIM
SIM
SIM
gainly. . .
SIM
dancer#1
social_dancer
dancer#2professional_dancer
2004/10/08
Annotating the Glosses
• Automatically tag monosemous words/collocations
• For gold standard quality, sense-tagging of polysemous words must be done manually
• More accurate sense-tagged data means better results for WSD systems, which means better performance from applications that depend on WSD
2004/10/08
System overview
• Preprocessor– Gloss “parser” and tokenizer/lemmatizer– Semantic class recognizer– Noun phrase chunker– Collocation recognizer (globber)
• Automatic sense tagger for monosemous terms
• Manual tagging interface
2004/10/08
Logical structure of a Gloss
• Smallest unit is a word, contracted form, or non-lexical punctuation
• Collocations are decomposed into their constituent parts– Allows coding of discontinuous collocations– A collocation can be treated either as a single
unit or a sequence of forms
2004/10/08
Example glosses
• n. pass, toss, flip: (sports) the act of throwing the ball to another member of your team; "the pass was fumbled"
• n. brace, suspender: elastic straps that hold trousers up (usually used in the plural)
• v. kick: drive or propel with the foot
2004/10/08
Optional info preceding def: domain category, etc.
def
Optional infofollowing def:usage info, etc.
ex*
2004/10/08
Optional info preceding def: domain category, etc.
def
Optional infofollowing def:usage info, etc.
ex*
2004/10/08
[a] [musical] [composition] [or] [passage] [performed] [quickly] . . .
def
{ allegro#2 }, n
coll=a, sk=musical_composition%1:10:00::coll=b, sk=musical_passage%1:10:00::
coll=a coll=b
sk=perform%2:36:01::
sk=quickly%4:02:00::
2004/10/08
Gloss “parser”
• Regularization & clean-up of the gloss• Recognize & XML tag <def>, <aux>,
<ex>, <qf>, verb arguments, domain <classif>
• <aux> and <classif> contents do not get tagged
• Replace XML-unfriendly characters (&, <, >) with XML entities
2004/10/08
Tokenizer
• Isolate word forms
• Differentiate non-lexical from lexical punctuation– E.g., sentence-ending periods vs. periods in
abbreviations
• Recognize apostrophe vs. quotation marks– E.g., states’ rights vs. `college-bound students’
2004/10/08
Lemmatizer
• A lemma is the WordNet entry form plus WordNet part of speech
• Inflected forms are uninflected using a stemmer developed in-house specifically for this task
• A <wf> may be assigned multiple potential lemmas– saw: lemma=“saw%1|saw%2|see%2”
– feeling: lemma=“feeling%1|feel%2”
2004/10/08
Lemmatizer, cont.
• Exceptions: stopwords/phrases– Closed-class words (prepositions, pronouns,
conjunctions, etc.) – multi-word terms such as “by means of”,
“according to”, “granted that”
• Hyphenated terms not in WordNet get split and separately lemmatized– E.g., over-fed becomes over + fed
2004/10/08
Semantic class recognizer
• Recognizes and marks up parenthesized and free text belonging to a finite set of semantic classes
• chem(ical symbol), curr(ency), date, d(ate)range, math, meas(ure phrase), n(umeric)range, num(ber), punc(tuation), symb(olic text), time, year
• Words and phrases in these classes will not be sense-tagged
2004/10/08
Noun Phrase chunker
• Isolates noun phrases (“chunks”) in order to narrow the scope for finding noun collocations in the next stage
• Glosses are not otherwise syntactically parsed
• Trained and tagged POS using Thorsten Brants’s TnT statistical tagger
2004/10/08
Noun Phrase chunker, cont.
• Trained and chunked noun phrases using Steven Abney’s partial parser Cass
• Enabled automatic recognition of otherwise ambiguous noun compounds and fixed expressions– E.g., opening move (JJ NN vs. VBG NN vs. VBG VB
vs. NN VB), bill of fare (NN IN NN vs. VB IN NN)
• Effected an increase in noun collocation coverage by 25% (types) and 29% (tokens)
2004/10/08
Collocation recognizer
• Bag of Words approach– To find ‘North_America’, find glosses that have both
‘North’ and ‘America’
• Four passes1. Ghost: ‘bring_home_the_bacon’
• mark ‘bacon’ so it won’t be tagged as monosemous
2. Contiguous: ‘North_America’
3. Disjoint: North (and) [(South) America]
4. Examples: tag the synset’s collocations in its gloss
2004/10/08
Automatic sense-tagger
• Tag monosemous words.
• Words that have…– …only one lemmatized form– …only one WordNet sense– …not been marked as possibly ambiguous
• i.e. non wait-list words, non ‘bacon’ words
2004/10/08
The mantag interface
• Simplicity– Taggers will repeat the same actions hundreds
of times per day
• Automation– Instead of typing the 148,000 search terms, use
a centralized list– Also allows for easy tracking of double-
checking process
2004/10/08
2004/10/08
2004/10/08
Statistics
Total number of glosses 117,549
Total number of words (tokens) 1,221,341
Total taggable words (tokens) 658,958 (57.9%)
auto-tagged 86,914 13.2%
mono sense/pos 3,872 0.6%
poly sense and/or pos 567,944 86.2%
not in WN 228 ~0.0%
2004/10/08
Statistics, cont.
Initial taggable collocations (tokens) 49,726
auto-tagged 41,475 83.4%
mono sense/pos 462 0.9%
poly sense and/or pos 6,888 13.8%
not in WN 0 0.0%
2004/10/08
Statistics, cont.
Total taggable word types 61,811
auto-tagged 19,117 30.9%
mono sense/pos 760 1.2%
poly sense and/or pos 41,650 67.4%
words not in WN 127 0.2%
non-word forms 30 ~0.0%
2004/10/08
Statistics, cont.
Done thus far…
automatic tags 130,770
automatic collocations 49,726
manual tags 42,020
manual collocations 2,961
2004/10/08
Aim of ISI Effort
• Jerry Hobbs, Ulf Hermjakob, Nishit Rathod, Fahad al-Qahtani
• Gold standard translation of glosses into first-order logic with reified events
2004/10/08
In:
ISI Effort examples
ignore graceful#a#1
move#v#2
way#n#8
rhythmic#a#1
Out:
gloss for dance, v, 2:
dance-V-2'(e0,x) -> move-V-2'(e1,x) & in'(e2,e1,y) & graceful-A-1'(e3,y) & rhythmic-A-1'(e4,y) & way-N-8'(e5,y)
move ain graceful rhythmicand way
ignore ignore
2004/10/08
In:ISI Effort examples, cont.
compositiona musical or passage performed quickly
ignore
ignoremusical_composition#n#1
musical_passage#n#1
perform#v#2
quickly#r#4
allegro-N-2'(e0,x) -> musical_composition-N-1/musical_passage-N-1'(e1,x) & perform-V-2'(e2,y,x) & quick-D-4'(e3,e2)
musical_composition-N-1'(e1,x) ->musical_composition-N-1/musical_passage-N-1'(e1,x)
musical_passage-N-1'(e1,x) ->musical_composition-N-1/musical_passage-N-1'(e1,x)
Out:
gloss for allegro, n, 2:
2004/10/08
ISI Method
• Identify the most common gloss patterns and convert them first
• Parse– using Charniak’s parser:
• uneven, sometimes bizarre results (“aspen”: VBN)
– Hermjakob’s CONTEX parser:• greater local control
2004/10/08
ISI Progress
• Completed glosses of nouns with patterns:– NG (P NG)*: 45% of
nouns– + NG ((VBN | VING) NG): 15% of nouns
• 45 + 15 = 60% complete!
• But gloss patterns are in a Zipf distribution:
2004/10/08
NP (NP,PP)718141%
NP (NP,SBAR)297817%
NP (NP,VP)268415%
NP (NP,PP,SBAR)3632%
NP (NP,CC,NP)2802%
NP (DT,JJ,NN)2722%
Distribution of noun glosses
Recommended