Upload
huey
View
74
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Measuring Semantic Similarity between Words Using HowNet. ICCSIT 2008 Liuling DAI , Yuning XIA , Bin LIU , ShiKun WU School of Computer Science, Beijing Institute of Technology. HowNet. W_C= 工夫 DEF={Ability| 能力 :host={human| 人 }} DEF={Strength| 力量 :host={group| 群體 }{human| 人 }} - PowerPoint PPT Presentation
Citation preview
Measuring Semantic Similarity between Words Using HowNet
ICCSIT 2008Liuling DAI , Yuning XIA , Bin LIU , ShiKun WU
School of Computer Science, Beijing Institute of Technology
HowNet
• W_C=工夫• DEF={Ability|能力 :host={human|人 }}• DEF={Strength|力量 :host={group|群體 }{human|人 }}• DEF={time|時間 }
• Word : 工夫• Concept : {Ability|能力 :host={human|人 }}• Sememe : Ability|能力
Algorithms
• Similarity between sememes• Similarity between concepts• Similarity between words
• Amendment with thesaurus
Similarity between sememes
• Strategy 1
• Strategy 2
• d : Distance between S1 and S2• h : Depth of the first common parent node
of the two sememes• α , β : Parameters to adjust d,h
Similarity between concepts
• Word “Doctor”• DEF={human|人 :{own|有 :possession={Status|身分 :
domain={education|教育 },modifier={HighRank|高等 :degree={most|最 }}},possessor={~}}}
• Human → Primary sememe• Status, own … → Modifying sememe• Possession , domain …
→ Descriptors
Similarity between concepts
• P , Q : Two concepts. Assume P has less number of modifying sememe.
• P_i , Q_j : ith, jth modifying sememe of P , Q.• S , T : Descriptor set of P , Q• α,β,γ : Weight of 3 parts
Similarity between words
• One word may has many concepts.• Choose the most similar pair.
Amendment with thesaurus
• Some words are missing and some DEFs are too rough in in HowNet.
• Using Chinese thesaurus Tongyici Cilin(同義詞詞林 )應為哈爾濱工業大學 IR-Lab的哈工大信息檢索研究室同義詞詞林擴展版
• d : Distance between W1 and W2
Similarity between words
• Sim1 : Eq. 6 (Similarity in HowNet)• Sim2 : Eq. 7 (Similarity in Tongyici Cilin)• α,β,γ,η : Parameters to scale the weights of
the two parts.
Evaluation• Dataset– RG-65
• Rubenstein and Goodenough established synonymy judgments for 65 pairs of nouns.They invited 51 human judges to assign every pair a score between 0.0 and 4.0 to indicate semantic similarity.
– MC-28• Miller and Charles follow this idea and restricted themselves to 30 pairs
of nouns selected from Rubenstein and Goodenough’s list, divided equally amongst words with high, intermediate and low similarity.
• For measuring similarity between Chinese words , translate RG-65 into Chinese manually.
Evaluation
• Parameters– Similarity between sememes• Strategy 1 : α = 1.6 , β = 0.16• Strategy 2 : α = 0.2 , β = 0.16
– Similarity between concepts• α = 0.54 , β = 0.36 , γ = 0.1
– Similarity between words• On Chinese dataset :α = 0.95,β = 0.05,γ = 0.95,η = 0.05• On English dataset : α = 0.95,β = 0.05,γ = 0.45,η = 0.55
Result– HAPI : HowNet_Get_Concept_Similarity in HowNet API
Result• In addition, They compare results to eight groups of measures
that rely on WordNet.• Table 1. Correlations coefficient of algorithms
Approach RG-28 MC-28 RG-65Hirst-St.Onge 0.671 0.682 0.732Jiang 0.67 0.682 0.732Leacock 0.801 0.82 0.852Lin 0.773 0.814 0.834Resnik 0.706 0.763 0.8Yang 0.889 0.921 0.897Li 0.8914 0.882 N/AAlvarez 0.9 0.913 N/AS1-English 0.9238 0.9074 0.8764S2-English 0.9286 0.9056 0.8744HAPI-English 0.5371 0.5113 0.6089S1-Chinese 0.8617 0.8401 0.8958S2-Chinese 0.8679 0.846 0.895HAPI-Chinese 0.5328 0.5001 0.6752
RG-65
MC-30 & RG-30