68
Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book

Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Embed Size (px)

Citation preview

Page 1: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Natural Language ProcessingLexical Semantics

Word Sense Disambiguation and Word SimilarityPotsdam, 31 May 2012

Saeedeh MomtaziInformation Systems Group

based on the slides of the course book

Page 2: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Outline

1 Lexical SemanticsWordNet

2 Word Sense Disambiguation

3 Word Similarity

Saeedeh Momtazi | NLP | 31.05.2012

2

Page 3: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Outline

1 Lexical SemanticsWordNet

2 Word Sense Disambiguation

3 Word Similarity

Saeedeh Momtazi | NLP | 31.05.2012

3

Page 4: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Word Meaning

Considering the meaning(s) of a word in addition to its writtenform

Word SenseA discrete representation of an aspect of the meaning of a word

Saeedeh Momtazi | NLP | 31.05.2012

4

Page 5: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Word

LexemeAn entry in a lexicon consisting of a pair:a form with a single meaning representation

Camel (animal)Camel (music band)

LemmaThe grammatical form that is used to represent a lexeme

Camel

Saeedeh Momtazi | NLP | 31.05.2012

5

Page 6: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Homonymy

Words which have similar form but different meanings

Camel (animal)Camel (music band) Homographs

WriteRight Homophone

Saeedeh Momtazi | NLP | 31.05.2012

6

Page 7: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Semantics Relations

Realizing lexical relations among words

Hyponymy (is a) {parent: hypernym, child: hyponym }dog & animal

Meronymy (part of)arm & body

Synonymyfall & autumn

Antonymytall & short

�Relations are between senses

rather than words

Saeedeh Momtazi | NLP | 31.05.2012

7

Page 8: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Outline

1 Lexical SemanticsWordNet

2 Word Sense Disambiguation

3 Word Similarity

Saeedeh Momtazi | NLP | 31.05.2012

8

Page 9: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

WordNet

A hierarchical database of lexical relations

Three Separate sub-databasesNounsVerbsAdjectives and Adverbs

Closed class words are not included

Each word is annotated with a set of senses

Available onlinehttp://wordnetweb.princeton.edu/perl/webwn

Saeedeh Momtazi | NLP | 31.05.2012

9

Page 10: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

WordNet

Number of words in WordNet 3.0

Category EntryNoun 117,097Verb 11,488Adjective 22,141Adverb 4,061

Average number of senses in WordNet 3.0

Category SenseNoun 1.23Verb 2.16

Saeedeh Momtazi | NLP | 31.05.2012

10

Page 11: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Word Sense

Synset (synonym set)

Saeedeh Momtazi | NLP | 31.05.2012

11

Page 12: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Word Relations (Hypernym)

Saeedeh Momtazi | NLP | 31.05.2012

12

Page 13: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Word Relations (Sister)

Saeedeh Momtazi | NLP | 31.05.2012

13

Page 14: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Outline

1 Lexical SemanticsWordNet

2 Word Sense Disambiguation

3 Word Similarity

Saeedeh Momtazi | NLP | 31.05.2012

14

Page 15: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Applications

Information retrievalMachine translationSpeech synthesis

Saeedeh Momtazi | NLP | 31.05.2012

15

Page 16: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Information retrieval

Saeedeh Momtazi | NLP | 31.05.2012

16

Page 17: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Machine translation

Saeedeh Momtazi | NLP | 31.05.2012

17

Page 18: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Example

Sense: band 532736 Music NThe band made copious recordings now regarded as classic from1941 to 1950. These were to have a tremendous influence on theworldwide jazz revival to come During the war Lu led a 20 piecenavy band in Hawaii.

Saeedeh Momtazi | NLP | 31.05.2012

18

Page 19: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Example

Sense: band 532838 Rubber-band NHe had assumed that so famous and distinguished a professor

would have been given the best possible medical attention it wasthe sort of assumption young men make. Here suspended fromLewis’s person were pieces of tubing held on by rubber bands anold wooden peg a bit of cork.

Saeedeh Momtazi | NLP | 31.05.2012

19

Page 20: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Example

Sense: band 532734 Range N

There would be equal access to all currencies financialinstruments and financial services dash and no majorconstitutional change. As realignments become more rare andexchange rates waver in narrower bands the system could evolveinto one of fixed exchange rates.

Saeedeh Momtazi | NLP | 31.05.2012

20

Page 21: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Word Sense Disambiguation

InputA wordThe context of the wordSet of potential senses for the word

OutputThe best sense of the word for this context

Saeedeh Momtazi | NLP | 31.05.2012

21

Page 22: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Approaches

Thesaurus-based

Supervised learning

Semi-supervised learning

Saeedeh Momtazi | NLP | 31.05.2012

22

Page 23: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Thesaurus-based

Extracting sense definitions from existing sourcesDictionariesThesauriWikipedia

Saeedeh Momtazi | NLP | 31.05.2012

23

Page 24: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Thesaurus-based

Saeedeh Momtazi | NLP | 31.05.2012

24

Page 25: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

The Lesk Algorithm

Selecting the sense whose definition shares the most wordswith the word’s context

Simplified Algorithm [Kilgarriff and Rosenzweig, 2000]

Saeedeh Momtazi | NLP | 31.05.2012

25

Page 26: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

The Lesk Algorithm

Simple to implementNo training data neededRelatively bad results

Saeedeh Momtazi | NLP | 31.05.2012

26

Page 27: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Supervised Learning

Training data:A corpus in which each occurrence of the ambiguous word w isannotated by its correct sense

SemCor : 234,000 sense-tagged from Brown corpusSENSEVAL-1: 34 target wordsSENSEVAL-2: 73 target wordsSENSEVAL-3: 57 target words (2081 sense-tagged)

Saeedeh Momtazi | NLP | 31.05.2012

27

Page 28: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Feature Selection

Using the words in the context with a specific window size

CollocationConsidering all words in a window (as well as their POS) and theirposition

Bag-of-wordConsidering the frequent words regardless their positionDeriving a set of k most frequent words in the window from thetraining corpusRepresenting each word in the data as a k -dimention vectorFinding the frequency of the selected words in the context of thecurrent observation

Saeedeh Momtazi | NLP | 31.05.2012

28

Page 29: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Collocation

Sense: band 532734 Range N

There would be equal access to all currencies financialinstruments and financial services dash and no majorconstitutional change. As realignments become more rare andexchange rates waver in narrower bands the system could evolveinto one of fixed exchange rates.

Window size: +/- 3Context: waver in narrower bands the system could{Wn−3, Pn−3, Wn−2, Pn−2, Wn−1, Pn−1, Wn+1, Pn+1, Wn+2, Pn+2, Wn+3, Pn+3}{waver, NN, in , IN , narrower, JJ, the, DT, system, NN , could, MD}

Saeedeh Momtazi | NLP | 31.05.2012

29

Page 30: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Bag-of-word

Sense: band 532734 Range N

There would be equal access to all currencies financialinstruments and financial services dash and no majorconstitutional change. As realignments become more rare andexchange rates waver in narrower bands the system could evolveinto one of fixed exchange rates.

Window size: +/- 3Context: waver in narrower bands the system couldk frequent words for band:{circle, dance, group, jewelery, music, narrow, ring, rubber, wave}{ 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 1 }

Saeedeh Momtazi | NLP | 31.05.2012

30

Page 31: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Naïve Bayes Classification

Choosing the best sense s out of all possible senses si for afeature vector ~f of the word w

s = argmaxsiP(si |~f )

s = argmaxsi

P(~f |si) · P(si)

P(~f )

P(~f ) has no effect

s = argmaxsiP(~f |si) · P(si)

Saeedeh Momtazi | NLP | 31.05.2012

31

Page 32: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Naïve Bayes Classification

s = argmaxsiP(si) · P(~f |si)

Likelihood

ProbabilityPrior

Probability

s = argmaxsiP(si) ·

m∏j=1

P(fj |si)

P(si) =#(si)

#(w)

#(si): number of times the sense si is used for the word w in the training data#(w): the total number of samples for the word w

Saeedeh Momtazi | NLP | 31.05.2012

32

Page 33: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Naïve Bayes Classification

s = argmaxsiP(si) · P(~f |si)

Likelihood

ProbabilityPrior

Probability

s = argmaxsiP(si) ·

m∏j=1

P(fj |si)

P(fj |si) =#(fj , si)

#(si)

#(fj , si): the number of times the feature fj occurred for the sense si of word w#(si): the total number of samples of w with the sense si in the training data

Saeedeh Momtazi | NLP | 31.05.2012

33

Page 34: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Semi-supervised Learning

What is the best approach when we do not have enough datato train a model?

Saeedeh Momtazi | NLP | 31.05.2012

34

Page 35: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Semi-supervised Learning

A small amount of labeled data

A large amount of unlabeled data

SolutionFinding the similarity between the labeled andunlabeled dataPredicting the labels of the unlabeled data

Saeedeh Momtazi | NLP | 31.05.2012

35

Page 36: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Semi-supervised Learning

What is the best approach when we do not have enough datato train a model?

For each sense,Select the most important word which frequently co-occurs withthe target word only for this particular senseFind the sentences from unlabeled data which contain the targetword and the selected wordLabel the sentence with the corresponding senseAdd the new labeled sentences to the training data

Example for Band

sense selected wordMusic playRubber elasticRange spectrum

Saeedeh Momtazi | NLP | 31.05.2012

36

Page 37: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Outline

1 Lexical SemanticsWordNet

2 Word Sense Disambiguation

3 Word Similarity

Saeedeh Momtazi | NLP | 31.05.2012

37

Page 38: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Word Similarity

Task

Finding the similarity between two wordsCovering somewhat a wider range of relations in the meaning(different with synonymy)Being defined with a score (degree of similarity)

ExampleBank (financial institute) & fundcar & bicycle

Saeedeh Momtazi | NLP | 31.05.2012

38

Page 39: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Applications

Information retrievalQuestion answeringDocument categorizationMachine translationLanguage modelingWord clustering

Saeedeh Momtazi | NLP | 31.05.2012

39

Page 40: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Information retrieval &Question Answering

Saeedeh Momtazi | NLP | 31.05.2012

40

Page 41: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Approaches

Thesaurus-basedBased on their distance in thesaurusBased on their definition in thesaurus (gloss)

DistributionalBased on the similarity between their contexts

Saeedeh Momtazi | NLP | 31.05.2012

41

Page 42: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Thesaurus-based Methods

Two concepts (sense) are similar if they are “nearby”(if there is a short path between them in the hypernym hierarchy)

Saeedeh Momtazi | NLP | 31.05.2012

42

Page 43: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Path-base Similarity

pathlen(c1, c2) = 1 + number of edges in the shortest pathbetween the sense nodes c1 and c2

simpath(c1, c2) = − log pathlen(c1, c2)

wordsim(w1,w2) = maxc1∈senses(w1)c2∈senses(w2)

sim(c1, c2)

when we have no knowledge about the exact sense(which is the case when processing general text)

Saeedeh Momtazi | NLP | 31.05.2012

43

Page 44: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Path-base Similarity

Shortcoming

Assumes that each link represents a uniform distanceNickel to money seems closer than to standard

SolutionUsing a metric which represents the cost of each edgeindependently⇒Words connected only through abstract nodes are less similar

Saeedeh Momtazi | NLP | 31.05.2012

44

Page 45: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Information Content Similarity

Assigning a probability P(c) to each node of thesaurusP(c) is the probability that a randomly selected word in a corpusis an instance of concept c⇒ P(root) = 1, since all words are subsumed by the root conceptThe probability is trained by counting the words in a corpusThe lower a concept in the hierarchy, the lower its probability

P(c) =

∑w∈words(c)#w

Nwords(c) is the set of words subsumed by concept cN is the total number of words in the corpus that are available in thesaurus

Saeedeh Momtazi | NLP | 31.05.2012

45

Page 46: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Information Content Similarity

words(coin) = {nickel, dime}words(coinage) = {nickel, dime, coin}words(money) = {budget, fund}

words(medium of exchange) = {nickel, dime, coin, coinage, currency, budget, fund, money}

Saeedeh Momtazi | NLP | 31.05.2012

46

Page 47: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Information Content Similarity

Augmenting each concept in the WordNet hierarchy with aprobability P(c)

Saeedeh Momtazi | NLP | 31.05.2012

47

Page 48: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Information Content Similarity

Information Content:IC(c) = − log P(c)

Lowest common subsumer:LCS(c1, c2) = the lowest node in the hierarchy that subsumes both c1 and c2

Saeedeh Momtazi | NLP | 31.05.2012

48

Page 49: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Information Content Similarity

Resnik similarity

Measuring the common amount of information by the informationcontent of the lowest common subsumer of the two concepts

simresnik (c1, c2) = − log P(LCS(c1, c2))

simresnik (hill,coast) = − log P(geological-formation)

Saeedeh Momtazi | NLP | 31.05.2012

49

Page 50: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Information Content Similarity

Lin similarity

Measuring the difference between two concepts in addition totheir commonality

simLin(c1, c2) =2 log P(LCS(c1, c2))

log P(c1) + log P(c2)

simLin(hill,coast) =2 log P(geological-formation)

log P(hill) + P(coast)

Saeedeh Momtazi | NLP | 31.05.2012

50

Page 51: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Information Content Similarity

Jiang-Conrath similarity

simJC(c1, c2) =1

log P(c1) + log P(c2)− 2 log P(LCS(c1, c2))

simJC(hill,coast) =1

log P(hill) + P(coast)− 2 log P(geological-formation)

Saeedeh Momtazi | NLP | 31.05.2012

51

Page 52: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Extended Lesk

Looking at word definitions in thesaurus (gloss)Measuring the similarity base on the number of common wordsin their definitionAdding a score of n2 for each n-word phrase that occurs in bothglossesComputing overlap for other relations as well(gloss of hypernyms and hyponyms)

simeLesk =∑

r ,q∈RELS

overlap(gloss(r(c1),gloss(q(c2)))

Saeedeh Momtazi | NLP | 31.05.2012

52

Page 53: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Extended Lesk

Drawing paper

paper that is specially prepared for use in drafting

Decalthe art of transferring designs from specially prepared paper to awood or glass or metal surface

common phrases: specially prepared and paper

simeLesk = 1 + 22 = 1 + 4 = 5

Saeedeh Momtazi | NLP | 31.05.2012

53

Page 54: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Thesaurus-based Similarities

Overview

Saeedeh Momtazi | NLP | 31.05.2012

54

Page 55: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Available Libraries

WordNet::Similarity

Source:http://wn-similarity.sourceforge.net/

Web-based interface:http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi

Saeedeh Momtazi | NLP | 31.05.2012

55

Page 56: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Thesaurus-based Methods

Shortcomings

Many words are missing in thesaurusOnly use hyponym info

Might useful for nouns, but weak for adjectives, adverbs, and verbs

Many languages have no thesaurus

AlternativeUsing distributional methods for word similarity

Saeedeh Momtazi | NLP | 31.05.2012

56

Page 57: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Distributional Methods

Using context information to find the similarity between wordsGuessing the meaning of a word based on its context

tezgüino?

tezgüino?

A bottle of tezgüino is on the tableEverybody likes tezgüinoTezgüino makes you drunkWe make tezgüino out of corn

⇒ An alcoholic beverageSaeedeh Momtazi | NLP | 31.05.2012

57

Page 58: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Context Representations

Considering a target term tBuilding a vocabulary of M words ({w1,w2,w3, ...,wM})Creating a vector for t with M features (t = {f1, f2, f3, ..., fM})

fi means the number of times the word wi occurs in the contextof t

tezgüino?

A bottle of tezgüino is on the tableEverybody likes tezgüinoTezgüino makes you drunkWe make tezgüino out of corn

t = tezgüinovocab = {book, bottle, city, drunk, like, water,...}

t = { 0 , 1 , 0 , 1 , 1 , 0 ,...}Saeedeh Momtazi | NLP | 31.05.2012

58

Page 59: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Context Representations

Term-term matrixThe number of times the context word c appear close to the termt in within a window

art boil data function large sugar summarize waterapricot 0 1 0 0 1 2 0 1pineapple 0 1 0 0 1 1 0 1digital 0 0 1 3 1 0 1 0information 0 0 9 1 1 0 2 0

GoalFinding a good metric that based on the vectors of these fourwords shows

apricot and pineapple to be hight similardigital and information to be hight similarthe other four pairing (apricot & digital, apricot & information,pineapple & digital, pineapple & information) to be less similar

Saeedeh Momtazi | NLP | 31.05.2012

59

Page 60: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Distributional similarity

Three parameters should be specified

How the co-occurrence terms are defined? (what is a neighbor?)

How terms are weighted?

What vector distance metric should be used?

Saeedeh Momtazi | NLP | 31.05.2012

60

Page 61: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Distributional similarity

How the co-occurrence terms are defined? (what is a neighbor?)

Widow of k wordsSentenceParagraphDocument

Saeedeh Momtazi | NLP | 31.05.2012

61

Page 62: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Distributional similarity

How terms are weighted?Binary

1, if two words co-occur (no matter how often)0, otherwise

FrequencyNumber of times two words co-occur with respect to the total sizeof the corpus

P(t , c) = #(t,c)N

Pointwise Mutual informationNumber of times two words co-occur, compared with what wewould expect if they were independent

PMI(t , c) = log P(t,c)P(t)·P(c)

Saeedeh Momtazi | NLP | 31.05.2012

62

Page 63: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Distributional similarity

#(t , c)art boil data function large sugar summarize water

apricot 0 1 0 0 1 2 0 1pineapple 0 1 0 0 1 1 0 1digital 0 0 1 3 1 0 1 0information 0 0 9 1 1 0 2 0

P(t , c) {N = 28}art boil data function large sugar summarize water

apricot 0 0.035 0 0 0.035 0.071 0 0.035pineapple 0 0.035 0 0 0.035 0.035 0 0.035digital 0 0 0.035 0.107 0.035 0 0.035 0information 0 0 0.321 0.035 0.035 0 0.071 0

Saeedeh Momtazi | NLP | 31.05.2012

63

Page 64: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Pointwise Mutual Informationart boil data function large sugar summarize water

apricot 0 0.035 0 0 0.035 0.071 0 0.035pineapple 0 0.035 0 0 0.035 0.035 0 0.035digital 0 0 0.035 0.107 0.035 0 0.035 0information 0 0 0.321 0.035 0.035 0 0.071 0

P(digital, summarize) = 0.035P(information, function) = 0.035

P(digital, summarize) = P(information, function)

PMI(digital, summarize) =?PMI(information, function) =?

Saeedeh Momtazi | NLP | 31.05.2012

64

Page 65: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Pointwise Mutual Informationart boil data function large sugar summarize water

apricot 0 0.035 0 0 0.035 0.071 0 0.035pineapple 0 0.035 0 0 0.035 0.035 0 0.035digital 0 0 0.035 0.107 0.035 0 0.035 0information 0 0 0.321 0.035 0.035 0 0.071 0

P(digital, summarize) = 0.035P(information, function) = 0.035

P(digital) = 0.212 P(summarize) = 0.106P(information) = 0.462 P(function) = 0.142

PMI(digital, summarize) = P(digital,summarize)P(digital)·P(summarize) = 0.035

0.212×0.106 = 1.557

PMI(information, function) = P(information,function)P(information)·P(function) = 0.035

0.462×0.142 = 0.533

P(digital, summarize) > P(information, function)

Saeedeh Momtazi | NLP | 31.05.2012

65

Page 66: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Distributional similarity

How terms are weighted?

Binary

Frequency

Pointwise Mutual information

PMI(t , c) = log P(t,c)P(t)·P(c)

t-test

t − test(t , c) = P(t,c)−P(t)·P(c)√P(t)·P(c)

Saeedeh Momtazi | NLP | 31.05.2012

66

Page 67: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Distributional similarity

What vector distance metric should be used?

Cosine

Simcosine(~v , ~w) =∑

i vi×wi√∑i v2

i

√∑i w2

i

Jaccard

Simjaccard (~v , ~w) =∑

i min(vi ,wi )∑i max(vi ,wi )

Dice

Simdice(~v , ~w) =2×

∑i min(vi ,wi )∑i (vi+wi )

Saeedeh Momtazi | NLP | 31.05.2012

67

Page 68: Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

Further Reading

Speech and Language ProcessingChapters 19, 20

Saeedeh Momtazi | NLP | 31.05.2012

68