55
SENTENCE SIMILARITY BASED ON SEMANTIC NETS AND CORPUS STATISTICS Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crockett the School of Computing and Intelligent Systems, the Uni- versity of Ulster. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 8, AUGUST 2006 1 Reporter: Yi-Ting Huang Date:2009/12/02

Sentence Similarity Based on Semantic Nets and Corpus Statistics

  • Upload
    takara

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Sentence Similarity Based on Semantic Nets and Corpus Statistics. Reporter: Yi-Ting Huang Date:2009/12/02. Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crockett the School of Computing and Intelligent Systems, the Uni - versity of Ulster. - PowerPoint PPT Presentation

Citation preview

Page 1: Sentence Similarity Based on Semantic Nets and Corpus Statistics

1

SENTENCE SIMILARITY BASED ON SEMANTIC NETS AND CORPUS STATISTICS

Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crockett

the School of Computing and Intelligent Systems, the Uni-versity of Ulster.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 8, AUGUST 2006

Reporter: Yi-Ting HuangDate:2009/12/02

Page 2: Sentence Similarity Based on Semantic Nets and Corpus Statistics

2

Research background

Recent applications of natural language processing present a need for an effective method to compute the similarity between very short texts or sentences. a conversational agent/dialogue system text summarization machine translation web page retrieval text mining

Page 3: Sentence Similarity Based on Semantic Nets and Corpus Statistics

3

Three major drawbacks of traditional methods

A sentence is represented in a very high-dimensional space with hundreds or thousands of dimensions sparseness.

Some methods require the user’s intensive involvement to manually preprocess sentence information.

Once the similarity method is designed for an application domain, it cannot be adapted easily to other domains.

Page 4: Sentence Similarity Based on Semantic Nets and Corpus Statistics

4

Research question

Traditionally, techniques for detecting similarity between long texts (documents) have centered on analyzing shared words.

However, in short texts, word co-occurrence may be rare or even null.

The focus of this paper is on computing the similarity between very short texts.

An effective method is expected to be dynamic in only focusing on the sentences of concern, fully automatic without requiring the users’ manual work, and readily adaptable across the range of potential application domains.

Page 5: Sentence Similarity Based on Semantic Nets and Corpus Statistics

5

Outline

Introduction Related works Methods Implementation Experiment Conclusion

Page 6: Sentence Similarity Based on Semantic Nets and Corpus Statistics

6

Outline

Introduction Related works

word co-occurrence methods corpus-based methods descriptive features-based methods

Methods Implementation Experiment Conclusion

Page 7: Sentence Similarity Based on Semantic Nets and Corpus Statistics

7

Word co-occurrence methods (1/3)

The word co-occurrence methods are often known as the “bag of words” method. They are commonly used in Information Retrieval (IR) systems.

This technique relies on the assumption that more similar documents share more of the same words.

Drawbacks: The sentence representation is not very efficient. The word set in IR systems usually excludes function

words such as “the, of, an,” etc. Sentences with similar meaning do not necessarily share

many words.

Page 8: Sentence Similarity Based on Semantic Nets and Corpus Statistics

8

Word co-occurrence methods (2/3)

One extension of word co-occurrence methods is the use of a lexical dictionary to compute the similarity of a pair of words taken from the two sentences that are being compared.

Sentence similarity is simply obtained by aggregating similarity values of all word pairs.

Page 9: Sentence Similarity Based on Semantic Nets and Corpus Statistics

9

Word co-occurrence methods (3/3)

Pattern matching differs from pure word co-occurrence methods by incorporating local structural information about words in the predicated sentences.

This technique requires a complete pattern set for each meaning in order to avoid ambiguity and mismatches.

Drawbacks: Manual compilation is an immensely arduous and tedious

task. Once the pattern sets are defined, the algorithm is unable

to cope with unplanned novel utterances from human users.

Page 10: Sentence Similarity Based on Semantic Nets and Corpus Statistics

10

Corpus-based methods (1/7)

One recent active field of research that contributes to sentence similarity computation is the methods based on statistical information of words in a huge corpus. Latent semantic analysis (LSA) Hyperspace Analogues to Language (HAL)

Page 11: Sentence Similarity Based on Semantic Nets and Corpus Statistics

11

Corpus-based methods (2/7)

Latent semantic analysis (LSA): The original word by context matrix is then

reconstructed from the reduced dimensional space.

When LSA is used to compute sentence similarity, a vector for each sentence is formed in the reduced dimension space; similarity is then measured by computing the similarity of these two vectors.

Page 12: Sentence Similarity Based on Semantic Nets and Corpus Statistics

12

Page 13: Sentence Similarity Based on Semantic Nets and Corpus Statistics

13

Page 14: Sentence Similarity Based on Semantic Nets and Corpus Statistics

14

Corpus-based methods (5/7)

Drawbacks of LSA: Because of the computational limit of SVD, the

dimension size of the word by context matrix is limited to several hundred.

The dimension is fixed and, so, the vector is fixed and is thus likely to be a very sparse representation of a short text such as a sentence.

LSA ignores any syntactic information from the two sentences being compared.

Page 15: Sentence Similarity Based on Semantic Nets and Corpus Statistics

15

Corpus-based methods (6/7)

Hyperspace Analogues to Language (HAL) is closely related to LSA and they both capture the meaning of a word or text using lexical co-occurrence information.

An N×N matrix is formed for a given vocabulary of N words. Each entry of the matrix records the (weighted) word co-occurrences within the window moving through the entire corpus.

Page 16: Sentence Similarity Based on Semantic Nets and Corpus Statistics

16

Corpus-based methods (7/7)

HAL’s drawback may be due to the building of the memory matrix and its approach to forming sentence vectors.

Authors’ experimental results showed that HAL was not as promising as LSA in the computation of similarity for short text.

Page 17: Sentence Similarity Based on Semantic Nets and Corpus Statistics

17

Features-based methods

The feature vector method tries to represent a sentence using a set of predefined features, e.g. nouns may have features such as HUMAN (with value of human or nonhuman), SOFTNESS (soft or hard), etc.

The difficulties for this method lie in the definition of effective features and in automatically obtaining values for features from a sentence.

It still is problematic to define features for abstract concepts.

Page 18: Sentence Similarity Based on Semantic Nets and Corpus Statistics

18

Outline

Introduction Related works Methods

Semantic similarity between words Semantic similarity between sentences Word order similarity between sentences

Implementation Experiment Conclusion

Page 19: Sentence Similarity Based on Semantic Nets and Corpus Statistics

19

The proposed methods

Page 20: Sentence Similarity Based on Semantic Nets and Corpus Statistics

20

Path length

Path length is obtained by counting synset links along the path between the two words.

If a word is polysemous (i.e., a word having many meanings), multiple paths may exist between the two words. Only the shortest path is then used in calculating semantic similarity between words.

Page 21: Sentence Similarity Based on Semantic Nets and Corpus Statistics

21

Semantic similarity between words

sim(boy, girl)=? 4sim(boy, teacher)=? 6similarity(boy, girl)>similarity(boy, teacher)

However, sim(boy, animal)=? 4similarity(boy, gril)=similarity=(boy, animal) ?

semantic similarity s(w1,w2)= the minimum length of path connecting the two words.

Page 22: Sentence Similarity Based on Semantic Nets and Corpus Statistics

22

Semantic similarity between words

It is apparent that words at upper layers of the hierarchy have more general semantics and less similarity between them, while words at lower layers have more concrete semantics and more similarity the depth of word in the hierarchy should be taken into account.

Page 23: Sentence Similarity Based on Semantic Nets and Corpus Statistics

23

Path depth

The depth of the subsumer is derived by counting the levels from the subsumer to the top of the lexical hierarchy.

Page 24: Sentence Similarity Based on Semantic Nets and Corpus Statistics

24

Properties of transfer function

The direct use of information sources as a metric of similarity is inappropriate due to its infinite property.

If we assign exactly the same with a value of 1 and no similarity as 0, then the interval of similarity is [0, 1].

When the path length decreases to zero, the similarity would monotonically increase toward the limit 1, while path length increases infinitely (although this would not happen in an organized lexical database), the similarity should monotonically decrease to 0.

Page 25: Sentence Similarity Based on Semantic Nets and Corpus Statistics

25

Contribution of path length The path length between two words, w1 and w2, can be determined

from one of three cases:

Case 1 implies that w1 and w2 have the same meaning; we assign the semantic path length between w1 and w2 to 0.

Case 2 indicates that w1 and w2 partially share the same features; we assign the semantic path length between w1 and w2 to 1.

Case 3, we count the actual path length between w1 and w2.-

Page 26: Sentence Similarity Based on Semantic Nets and Corpus Statistics

26

Scaling Depth Effect

We therefore need to scale down s(w1,w2) for subsuming words at upper layers and to scale up s(w1,w2) for subsuming words at lower layers.

Page 27: Sentence Similarity Based on Semantic Nets and Corpus Statistics

27

Word similarity measure

Page 28: Sentence Similarity Based on Semantic Nets and Corpus Statistics

28

Semantic Similarity between Sentences

Page 29: Sentence Similarity Based on Semantic Nets and Corpus Statistics

29

Joint word set

Since inflectional morphology may cause a word to appear in a sentence with different forms that convey a specific meaning for a specific context, we use word form as it appears in the sentence.

Page 30: Sentence Similarity Based on Semantic Nets and Corpus Statistics

30

Lexical semantic vectors (1/3)

The value of an entry of the lexical semantic vector, is determined by the semantic similarity of the corresponding word to a word in the sentence.

Take T1 as an example:

The maximum similarity scores may be very low, indicating that the words are highly dissimilar.

Page 31: Sentence Similarity Based on Semantic Nets and Corpus Statistics

31

Lexical semantic vectors (2/3)

We also keep all function words because function words carry syntactic information that cannot be ignored if a text is very short.

Although they contribute less to the meaning of a sentence than other words.

We weight the significance of a word using its information content.

It has been shown that words that occur with a higher frequency (in a corpus) contain less information than those that occur with lower frequencies.

Page 32: Sentence Similarity Based on Semantic Nets and Corpus Statistics

32

Lexical semantic vectors (3/3)

The information content of a word is derived from its probability in a corpus.

We ignore WSD here.

Page 33: Sentence Similarity Based on Semantic Nets and Corpus Statistics

33

Example

Page 34: Sentence Similarity Based on Semantic Nets and Corpus Statistics

34

Word order similarity between sentences

Page 35: Sentence Similarity Based on Semantic Nets and Corpus Statistics

35

Word order similarity between sentences (1/4)

Page 36: Sentence Similarity Based on Semantic Nets and Corpus Statistics

36

Word order similarity between sentences (2/4)

Page 37: Sentence Similarity Based on Semantic Nets and Corpus Statistics

37

Word order similarity between sentences (3/4)

Sr equals 1 if there is no word order difference. Sr is greater than or equal to 0 if word order difference is present.

Page 38: Sentence Similarity Based on Semantic Nets and Corpus Statistics

38

Word order similarity between sentences (4/4)

Page 39: Sentence Similarity Based on Semantic Nets and Corpus Statistics

39

Example

Page 40: Sentence Similarity Based on Semantic Nets and Corpus Statistics

40

Overall sentence similarity

Page 41: Sentence Similarity Based on Semantic Nets and Corpus Statistics

41

Example

This pair of sentences has only one coccurrence word, RAM, but the meaning of the sentences is similar. Word cooccurrence methods would result in a very low similarity measure [24], while the proposed method gives a relatively high similarity

Page 42: Sentence Similarity Based on Semantic Nets and Corpus Statistics

42

Outline

Introduction Related works Methods Implementation Experiment Conclusion

Page 43: Sentence Similarity Based on Semantic Nets and Corpus Statistics

43

Implementation

The database: WordNet is the main semantic knowledge base

for the calculation of semantic similarity. Brown Corpus is used to provide information

content.

Page 44: Sentence Similarity Based on Semantic Nets and Corpus Statistics

44

Searching in WordNet

A search of the semantic net is performed for the shortest path length between the synsets containing the compared words and the depth of the first synset, subsuming the synsets corresponding to the compared words.

The lexical hierarchy is connected by following trails of superordinate terms in “is a”, “is a kind of” (ISA) and part-whole relations.

If the word is not in WordNet, then the search will not proceed and the word similarity is simply assigned to zero.

Page 45: Sentence Similarity Based on Semantic Nets and Corpus Statistics

45

Statistics from the Brown Corpus

The probability of a word w in the corpus is computed simply as the relative frequency:

where N is the total number of words in the corpus, n is the frequency of the word w in the corpus (increased by 1 to avoid presenting an undefined value to the logarithm). Information content of w in the corpus is defined as:

Page 46: Sentence Similarity Based on Semantic Nets and Corpus Statistics

46

Outline

Introduction Related works Methods Implementation Experiment

Three parameters Experiment 1 Experiment 2

Conclusion

Page 47: Sentence Similarity Based on Semantic Nets and Corpus Statistics

47

Three parameters a threshold for deriving the semantic vector

to detect and utilize similar semantic characteristics of words to the greatest extent and to keep the noise low. This requires us to use a semantic threshold which is small, but not too small.

a threshold for forming the word order vector For the word order vector to be useful, the pair of linked words (the most similar words from the two sentences) must intuitively be quite similar as the relative ordering of less similar pairs of words provides very little information.

a factor for weighting the significance between semantic information and syntactic informationSince syntax plays a subordinate role for semantic processing of text, we weighted the semantic part higher.

We empirically found 0.4 for word order threshold, 0.2 for semantic threshold, and 0.85 for .

Page 48: Sentence Similarity Based on Semantic Nets and Corpus Statistics

48

Experiment 1

Sentence pairs in Table 2 were selected from a variety of papers and books on natural language understanding.

Sentence 1: “bachelor” vs. “unmarried man”As our technique compares words on a word-by-word basis, such multiple word phrases are currently missed, although similarities are found between the word pairs: bachelor-man and bachelor-unmarried.

Page 49: Sentence Similarity Based on Semantic Nets and Corpus Statistics

49

Experiment 1

Sentence 6 & 14: This difference is the consequence of neglecting multiple senses of polysemous words.

Orange is a color as well as a fruit and is found to be more similar to another word on this basis.

Word sense disambiguation may narrow this difference and it needs to be investigated in future work.

Page 50: Sentence Similarity Based on Semantic Nets and Corpus Statistics

50

Experiment 2

Participants: 32 volunteers, all native speakers of English educated to graduate level or above.

Materials: 65 noun word pairs was originally measured by Rubenstein and

Goodenough. Because the frequency distribution of the data exhibits a strong

bias, a specific subset of 30 pairs has been used, which reduces bias in the frequency distribution.

We replaced them with their definitions from the Collins Cobuild dictionary, and substituted a phrase from the verb definition into the noun definition to form a usable sentence.

Rating: 0.0 (minimum) to 4.0 (maximum similarity) 11 (@46) sentences rated 0.0 to 0.9 selected at equally spaced

intervals from the list.

Page 51: Sentence Similarity Based on Semantic Nets and Corpus Statistics

51

Page 52: Sentence Similarity Based on Semantic Nets and Corpus Statistics

52

Results

Our algorithm’s similarity measure achieved a reasonably good Pearson correlation coefficient of 0.816 with the human ratings, significant at the 0.01 level.

If we take the performance of the typical human, 0.825 as the upper bound, then it is reasonable to say that our similarity measure is performing well at 0.816, within the constraints of the experiment.

Page 53: Sentence Similarity Based on Semantic Nets and Corpus Statistics

53

Outline

Introduction Related works Methods Implementation Experiment Conclusion

Page 54: Sentence Similarity Based on Semantic Nets and Corpus Statistics

54

Conclusion

This paper presented a method for measuring the semantic similarity between sentences or very short texts, based on semantic and word order information. semantic similarity is derived from a lexical knowledge

base and a corpus. the proposed method considers the impact of word

order on sentence meaning. The overall sentence similarity is then defined as a

combination of semantic similarity and word order similarity.

Page 55: Sentence Similarity Based on Semantic Nets and Corpus Statistics

55

Further work

Further work will include the construction of a more varied sentence pair data set with human ratings and an improvement to the algorithm to disambiguate word sense using the surrounding words to give a little contextual information.