37
Annotating the WordNet Glosses Ben Haskell <[email protected]>

Annotating the WordNet Glosses Ben Haskell

  • Upload
    arlen

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Annotating the WordNet Glosses Ben Haskell . Annotating the Glosses. Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging) - PowerPoint PPT Presentation

Citation preview

Page 1: Annotating the WordNet  Glosses Ben Haskell

Annotating the WordNet Glosses

Ben Haskell <[email protected]>

Page 2: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Annotating the Glosses

• Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

• A disambiguation task: Process of linking an instance of a word to the WordNet synset representing its context-appropriate meaning, e.g.

run a company vs. run an errand

Page 3: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

{ run#29 }, v

{ control, command } { change, alter, modify }

{ end, terminate }

{ complete, finish }

{ carry_through, accomplish, exceute, carry_out, action, fulfil, fulfill }

{ make, create }

{ cause, do, make }

{ effect, effectuate, bring_about, set_up }

{ manage, deal care, handle }

{ direct }

{ run#12, operate }, v . . . . . . . . .

. . . run a company . . . . . . run an errand . . .

carry out;“run an errand”

direct or control; project, businesses, etc.“She is running a relief operation in theSudan”

Page 4: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Glosses as node points in the network of relations

• Once a word’s gloss is annotated, the synsets for all conceptually-related words used in the gloss can be accessed via their sense tags

• Situates the word in an expanded network of links to other semantically-related words/concepts in WordNet

Page 5: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

{ step }{ move }

{ dance#2 }, v

move in a graceful and rhythmical way;

IS-AENTAIL

DERIV

DERIVdancer#1

social_dancer

dancer#2professional_dancer

Page 6: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

{ step }{ move }

{ dance#2 }, v

move in a graceful and rhythmical way;

IS-AENTAIL

DERIV

DERIVdancer#1

social_dancer

dancer#2professional_dancer

Page 7: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

{ step }{ move }

awkward

{ graceful#1 }, a

{ dance#2 }, v

move in a graceful and rhythmical way;

IS-AENTAIL

DERIV

DERIV

ANT

deft

elegant

liquidfluent

fluid

SIM

SIM

SIM

gainly. . .

SIM

dancer#1

social_dancer

dancer#2professional_dancer

Page 8: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

{ step }{ move }

awkward

{ graceful#1 }, a

{ dance#2 }, v

move in a graceful and rhythmical way;

IS-AENTAIL

DERIV

DERIV

ANT

{ rhythmical#1 }, a

{ way#8 }, n

manner

mode

style

fashion

unrhythmical

ANT

beatingpulsating

pulsing

SIM

cadenced

cadent

SIM SIM

danceable

. . .

deft

elegant

liquidfluent

fluid

SIM

SIM

SIM

gainly. . .

SIM

dancer#1

social_dancer

dancer#2professional_dancer

Page 9: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Annotating the Glosses

• Automatically tag monosemous words/collocations

• For gold standard quality, sense-tagging of polysemous words must be done manually

• More accurate sense-tagged data means better results for WSD systems, which means better performance from applications that depend on WSD

Page 10: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

System overview

• Preprocessor– Gloss “parser” and tokenizer/lemmatizer– Semantic class recognizer– Noun phrase chunker– Collocation recognizer (globber)

• Automatic sense tagger for monosemous terms

• Manual tagging interface

Page 11: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Logical structure of a Gloss

• Smallest unit is a word, contracted form, or non-lexical punctuation

• Collocations are decomposed into their constituent parts– Allows coding of discontinuous collocations– A collocation can be treated either as a single

unit or a sequence of forms

Page 12: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Example glosses

• n. pass, toss, flip: (sports) the act of throwing the ball to another member of your team; "the pass was fumbled"

• n. brace, suspender: elastic straps that hold trousers up (usually used in the plural)

• v. kick: drive or propel with the foot

Page 13: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Optional info preceding def: domain category, etc.

def

Optional infofollowing def:usage info, etc.

ex*

Page 14: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Optional info preceding def: domain category, etc.

def

Optional infofollowing def:usage info, etc.

ex*

Page 15: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

[a] [musical] [composition] [or] [passage] [performed] [quickly] . . .

def

{ allegro#2 }, n

coll=a, sk=musical_composition%1:10:00::coll=b, sk=musical_passage%1:10:00::

coll=a coll=b

sk=perform%2:36:01::

sk=quickly%4:02:00::

Page 16: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Gloss “parser”

• Regularization & clean-up of the gloss• Recognize & XML tag <def>, <aux>,

<ex>, <qf>, verb arguments, domain <classif>

• <aux> and <classif> contents do not get tagged

• Replace XML-unfriendly characters (&, <, >) with XML entities

Page 17: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Tokenizer

• Isolate word forms

• Differentiate non-lexical from lexical punctuation– E.g., sentence-ending periods vs. periods in

abbreviations

• Recognize apostrophe vs. quotation marks– E.g., states’ rights vs. `college-bound students’

Page 18: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Lemmatizer

• A lemma is the WordNet entry form plus WordNet part of speech

• Inflected forms are uninflected using a stemmer developed in-house specifically for this task

• A <wf> may be assigned multiple potential lemmas– saw: lemma=“saw%1|saw%2|see%2”

– feeling: lemma=“feeling%1|feel%2”

Page 19: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Lemmatizer, cont.

• Exceptions: stopwords/phrases– Closed-class words (prepositions, pronouns,

conjunctions, etc.) – multi-word terms such as “by means of”,

“according to”, “granted that”

• Hyphenated terms not in WordNet get split and separately lemmatized– E.g., over-fed becomes over + fed

Page 20: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Semantic class recognizer

• Recognizes and marks up parenthesized and free text belonging to a finite set of semantic classes

• chem(ical symbol), curr(ency), date, d(ate)range, math, meas(ure phrase), n(umeric)range, num(ber), punc(tuation), symb(olic text), time, year

• Words and phrases in these classes will not be sense-tagged

Page 21: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Noun Phrase chunker

• Isolates noun phrases (“chunks”) in order to narrow the scope for finding noun collocations in the next stage

• Glosses are not otherwise syntactically parsed

• Trained and tagged POS using Thorsten Brants’s TnT statistical tagger

Page 22: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Noun Phrase chunker, cont.

• Trained and chunked noun phrases using Steven Abney’s partial parser Cass

• Enabled automatic recognition of otherwise ambiguous noun compounds and fixed expressions– E.g., opening move (JJ NN vs. VBG NN vs. VBG VB

vs. NN VB), bill of fare (NN IN NN vs. VB IN NN)

• Effected an increase in noun collocation coverage by 25% (types) and 29% (tokens)

Page 23: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Collocation recognizer

• Bag of Words approach– To find ‘North_America’, find glosses that have both

‘North’ and ‘America’

• Four passes1. Ghost: ‘bring_home_the_bacon’

• mark ‘bacon’ so it won’t be tagged as monosemous

2. Contiguous: ‘North_America’

3. Disjoint: North (and) [(South) America]

4. Examples: tag the synset’s collocations in its gloss

Page 24: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Automatic sense-tagger

• Tag monosemous words.

• Words that have…– …only one lemmatized form– …only one WordNet sense– …not been marked as possibly ambiguous

• i.e. non wait-list words, non ‘bacon’ words

Page 25: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

The mantag interface

• Simplicity– Taggers will repeat the same actions hundreds

of times per day

• Automation– Instead of typing the 148,000 search terms, use

a centralized list– Also allows for easy tracking of double-

checking process

Page 26: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Page 27: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Page 28: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Statistics

Total number of glosses 117,549

Total number of words (tokens) 1,221,341

Total taggable words (tokens) 658,958 (57.9%)

auto-tagged 86,914 13.2%

mono sense/pos 3,872 0.6%

poly sense and/or pos 567,944 86.2%

not in WN 228 ~0.0%

Page 29: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Statistics, cont.

Initial taggable collocations (tokens) 49,726

auto-tagged 41,475 83.4%

mono sense/pos 462 0.9%

poly sense and/or pos 6,888 13.8%

not in WN 0 0.0%

Page 30: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Statistics, cont.

Total taggable word types 61,811

auto-tagged 19,117 30.9%

mono sense/pos 760 1.2%

poly sense and/or pos 41,650 67.4%

words not in WN 127 0.2%

non-word forms 30 ~0.0%

Page 31: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Statistics, cont.

Done thus far…

automatic tags 130,770

automatic collocations 49,726

manual tags 42,020

manual collocations 2,961

Page 32: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

Aim of ISI Effort

• Jerry Hobbs, Ulf Hermjakob, Nishit Rathod, Fahad al-Qahtani

• Gold standard translation of glosses into first-order logic with reified events

Page 33: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

In:

ISI Effort examples

ignore graceful#a#1

move#v#2

way#n#8

rhythmic#a#1

Out:

gloss for dance, v, 2:

dance-V-2'(e0,x)       -> move-V-2'(e1,x) & in'(e2,e1,y) & graceful-A-1'(e3,y)          & rhythmic-A-1'(e4,y) & way-N-8'(e5,y)

move ain graceful rhythmicand way

ignore ignore

Page 34: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

In:ISI Effort examples, cont.

compositiona musical or passage performed quickly

ignore

ignoremusical_composition#n#1

musical_passage#n#1

perform#v#2

quickly#r#4

allegro-N-2'(e0,x) -> musical_composition-N-1/musical_passage-N-1'(e1,x) & perform-V-2'(e2,y,x) & quick-D-4'(e3,e2)

musical_composition-N-1'(e1,x) ->musical_composition-N-1/musical_passage-N-1'(e1,x)

musical_passage-N-1'(e1,x) ->musical_composition-N-1/musical_passage-N-1'(e1,x)

Out:

gloss for allegro, n, 2:

Page 35: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

ISI Method

• Identify the most common gloss patterns and convert them first

• Parse– using Charniak’s parser:

• uneven, sometimes bizarre results (“aspen”: VBN)

– Hermjakob’s CONTEX parser:• greater local control

Page 36: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

ISI Progress

• Completed glosses of nouns with patterns:– NG (P NG)*: 45% of

nouns– + NG ((VBN | VING) NG): 15% of nouns

• 45 + 15 = 60% complete!

• But gloss patterns are in a Zipf distribution:

Page 37: Annotating the WordNet  Glosses Ben Haskell

2004/10/08

NP (NP,PP)718141%

NP (NP,SBAR)297817%

NP (NP,VP)268415%

NP (NP,PP,SBAR)3632%

NP (NP,CC,NP)2802%

NP (DT,JJ,NN)2722%

Distribution of noun glosses