View
27
Download
3
Category
Preview:
DESCRIPTION
Learning Formulation and Transformation Rules for Multilingual Named Entities. Advisor : Dr. Hsu Reporter : Chun Kai Chen Author : Hsin-Hsi Chen, Changhua Yang and Ying Lin. Proceedings of the ACL 2003. Outline. Motivation Objective Introduction Multilingual Named Entity Corpora - PowerPoint PPT Presentation
Citation preview
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
Learning Formulation and Transformation Rules for Multilingual Named Entities
Advisor : Dr. Hsu
Reporter : Chun Kai Chen
Author : Hsin-Hsi Chen, Changhua Yang and Ying Lin
Proceedings of the ACL 2003
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Outline
Motivation Objective Introduction Multilingual Named Entity Corpora Rule Mining Experimental Results Conclusions Personal Opinion
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Motivation
The past works on multilingual named entities emphasizes on the transliteration issues
However, the transformation between named entities in different languages is not transliteration only─ Victoria Fall- 維多利亞瀑布─ Little Rocky Mountains- 小落磯山脈─ Kenmare- 康美爾─ East Chicago- 東芝加哥
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Objective
Propose a method extract─ formulation rules of named entities for individual
languages─ transformation rules for mapping among languages
Application of the results on cross language information retrieval (CLIR)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction(1/3)
In the past, named entity extraction ─ mainly focuses on general domains─ employed to various applications such as information r
etrieval, question-answering
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction(2/3) Most of the previous approaches
─ dealt with monolingual named entity extraction─ Chen et al.(1998) extended it to cross-language information retrieval (C
LIR) A grapheme-based model was ( 字母 )
─ proposed to compute the similarity between Chinese transliteration name and English name.
Lin and Chen (2000) further classified the works into two directions─ forward transliteration (Wan and Verspoor, 1998)─ backward transliteration (Chen et al., 1998; Knight and Graehl, 199
8),─ proposed a phoneme-based model
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction(3/3)
This paper will study ─ the issues of languages and named entity types on the
choices of translation and transliteration. ─ We focus on three more challenging named entities onl
y, i.e., named people named locations named organizations
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Multilingual Named Entity Corpora
NICT location name corpus─ Developed by Ministry of Education of Taiwan in 1995─ consists of three parts
Foreign location name, Chinese transliteration/translation name, country name (Victoria Fall, “ 維多利亞瀑布” (wei duo li ya pu bu), South Africa)
CNA personal name and organization corpora─ are used by news reporters to unify the name translitera
tion/translation in news stories
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Rule Mining
Frequency-Based Approach with a Bilingual Dictionary
Keyword Extraction without a Bilingual Dictionary
Extraction of Transformation Rules Extraction of Keywords at a Distance
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Learning Formulation and Transformation Rules
Frequency-Based with a Bilingual Dictionary
Keyword Extraction without a Bilingual Dictionary
Generate candidatesCount the frequency (TFIDF)
Victoria FallVictoria, “ 維多利亞” Fall, “ 瀑布”
World Taiwanese Association “ 世台會”
Decompose E
(s6) {Catalan Mountain , 卡太蘭山 }(s7) {Aletschhorn Mountain , 阿利奇赫恩山 }
{Catalan Mountain , 卡太蘭山 }{Catalan , 卡 太 蘭 山 }{e1, 卡太 太蘭 蘭山 }{e1, …}{e1, 卡太蘭山 }
{Mountain , 卡 太 蘭 山 }{e2, 卡太 太蘭 蘭山 }{e2, …}{e2, 卡太蘭山 }
{Aletschhorn Mountain , 阿利奇赫恩山 }{Aletschhorn , 阿 利 奇 赫 恩 山 }{e1, 阿利 利奇 奇赫 赫恩 恩山 }{e1, …}{e1, 阿利奇赫恩山 }
{Mountain , 阿 利 奇 赫 恩 山 }{e2, 阿利 利奇 奇赫 赫恩 恩山 }{e2, …}
{Mountain, “ 山” (shan)}
Extraction of Transformation Rules
(s6’) γ mountain ⇔ δ 山(s7’) γ mountain ⇔ δ 山(s8’) γ Strait ⇔ δ 海峽(s9’) γ, Strait of ⇔ δ 海峽
Extraction of Keywords at a Distance
“American Civil Liberties Union”.“American ∆ Liberties Union”“American Civil ∆ Union”“American ∆ Union”
Dictionary
“Mountain” ⇔ “ 山”
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Frequency-Based Approach with a Bilingual Dictionary We postulate
─ transliterated term is usually an unknown word and not listed in a lexicon
─ translated term often appears in a lexicon
Under this postulation ─ translated term( 翻譯詞 ) occurs more often in a corpus
Fall, “ 瀑布”─ transliterated term( 音譯詞 ) only appears very few
Victoria, “ 維多利亞”
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Frequency-based method(1/2) Simple frequency-based method will compute the frequencies
of terms and use them to tell out the transliteration and translation parts in a named entity─ Compute word frequencies of each word in the foreign name list─ Keep those words
appear more than a threshold appear in a common foreign dictionary these words form candidates of simple keywords
Mountain─ Examine the foreign word list again─ Cluster the Chinese name list
based on foreign keywords here a bilingual dictionary may be consulted “Mountain” ⇔ “ 山”
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Frequency-based method(2/2) NICT location name corpus
─ River ( 河 , he), Island ( 島 , dao), Lake ( 湖 ,hu), Mountain ( 山 , shan), Bay ( 灣 , wan), Mountain ( 峰 , feng), Peak ( 峰 , feng)
─ “Mountain” ⇔ “ 山” (shan) and “ 峰” (feng)─ “峰” (feng) ⇔ “Mountain” and “Peak”
CNA organization name corpus─ Suffix
Association ( 協會 , xie hui), University ( 大學 , da xue)─ Prefix
International ( 國際 , guo ji), World ( 世界 ,shi jie), American ( 美國 , mei guo)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Keyword Extraction without a Bilingual Dictionary (problem) Abbreviation is common adopted in translation,
dictionary-based approach is hard to capture this phenomenon─ (World Taiwanese Association,“ 世台會” )
Here another approach without dictionary is proposed
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Keyword Extraction without a Bilingual Dictionary (process)(s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山
─ {e1, s1 s2 … st} {Aletschhorn , 阿 利 奇 赫 恩
山 } {e1, 阿利 利奇 奇赫 赫恩 恩山 } {e1, 阿利奇 利奇赫 奇赫恩 赫恩
山 } {e1, 阿利奇赫 利奇赫恩 奇赫恩山 } {e1, 阿利奇赫恩 利奇赫恩山 } {e1, 阿利奇赫恩山 }
─ {e2, s1 s2 … st} {Mountain , 阿 利 奇 赫 恩
山 } {e2, 阿利 利奇 奇赫 赫恩 恩山 } {e2, 阿利奇 利奇赫 奇赫恩 赫恩
山 } {e2, 阿利奇赫 利奇赫恩 奇赫恩山 } {e2, 阿利奇赫恩 利奇赫恩山 } {e2, 阿利奇赫恩山 }
(s7) Catalan Mountain ⇔ 卡太蘭山─ {e1, s1 s2 … st}
{Catalan , 卡 太 蘭 山 } {e1, 卡太 太蘭 蘭山 } {e1, 卡太蘭 太蘭山 } {e1, 卡太蘭山 }
─ {e2, s1 s2 … st} {Mountain , 卡 太 蘭 山 } {e2, 卡太 太蘭 蘭山 } {e2, 卡太蘭 太蘭山 } {e2, 卡太蘭山 }
•{e, c} whose frequency > 2 are kept•{Mountain, “ 山” (shan)}
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Keyword Extraction without a Bilingual Dictionary (algorithm) {Ej, Cj}
─ Ej is a foreign named entity─ Cj is a Chinese named entity
decompose the named entities─ Ej
comprises m words w1·w2…wm a candidate segment ep, q is defined as wp … wq
─ Cj has n syllables s1·s2…sn a candidate segment cx, y is defined as sx … sy
─ we can get pairs of {ep, q, cx, y} from {Ej, Cj}. group and count
─ the pairs collected from the multilingual named entity list─ count the frequency for each occurrence─ pairs with higher frequency denote significant segment pairs
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Keyword Extraction without a Bilingual Dictionary (example) Example
─ All the pairs {e, c} whose frequency > 2 are kept─ {Mountain, “ 山” (shan)} and {Strait, “ 海峽” (ha
i xia)} appear twice
(s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山(s7) Catalan Mountain ⇔ 卡太蘭山(s8) Cook Strait ⇔ 科克海峽(s9) Dover, Strait of ⇔ 多佛海峽
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Keyword Extraction without a Bilingual Dictionary (problem) Two issues have to be addressed
─ redundancy which may exist in the pairs of segments should be eliminated carefully
─ e may be translated to more than one synonym “Association” ⇔“ 協會” (xie hui) and “ 聯誼會” (lian yi hui)
A metric to deal with the above issues is proposed)1 (log 2 iiii c)idf(c}) f({e,c})score({e,c
) (max
) (
}tf{e,c
}{e,ctf })f({e,c
jj
i
i
)(log 2
)df(c
N )idf(c
i
i
}) ,({max arg icescore c
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Extraction of Transformation Rules
Chinese location name keyword ─ tends to be located in the rightmost─ the remaining part is a transliterated name
Foreign location name keyword ─ tends to be either located in the rightmost, or permuted by some preposi
tions, comma, and the transliterating part
(s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山(s7) Catalan Mountain ⇔ 卡太蘭山(s8) Cook Strait ⇔ 科克海峽(s9) Dover, Strait of ⇔ 多佛海峽
(s6’) γ mountain ⇔ δ 山(s7’) γ mountain ⇔ δ 山(s8’) γ Strait ⇔ δ 海峽(s9’) γ, Strait of ⇔ δ 海峽
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Extraction of Keywords at a Distance
(s12) and (s13)─ English compound keyword is separated and so is its corresponding Chi
nese counterpart
(s14) and (s15)─ English compound keyword is connected in ─ but the corresponding Chinese translation is separated
(s12) American Podiatric medical Association ⇔ 美國足病醫療學會(s13) American Public Health Association ⇔ 美國公共衛生學會(s14) American Society for Industrial Security ⇔ 美國工業安全協會(s15) American Society of Newspaper Editors ⇔ 美國報紙編輯人協會
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Extraction of Keywords at a Distance
Introduce a symbol ∆ to cope with the distance issue─ “American Civil Liberties Union”.─ “American ∆ Liberties Union”─ “American Civil ∆ Union”─ “American ∆ Union”
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental Analysis (corpus) NICT location corpus
─ Total 122 keyword pairs are identified─ Total 230 transformation rules─ On the average, a keyword pair corresponds to 1.89 transformation rules
CNA personal names─ are composed of more than one Word
(100 / 50,586)─ the number of keywords extracted is only a few
De ⇔ 戴 (dai), La ⇔ 拉 (la), De La ⇔ 戴拉 (dai la), Du ⇔ 杜 (du), David ⇔ 大衛 (da wei)
CNA organization─ are composed of more than one Word
(12,885 / 14,658)─ 5,229 keyword pairs are extracted─ most of the keyword pairs are meaning translated
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental Analysis (classify) We classify these keyword pairs into the following types
─ Meaning translation common location keywords
Bir ⇔ 井 (jing), Ain ⇔ 泉 (quan),Bahr ⇔ 河 (he), Cerro ⇔ 山 (shan) Direction
Central ⇔ 中 (zhong), East ⇔ 東 (dong), etc.) size (e.g., Big ⇔ 大 (da)), length (e.g, Long ⇔ 長 (zhang)), color (e.g., Black ⇔ 黑 (hei), Blue ⇔ 藍 (lan), etc.)
the specificity of place or area Crystal ⇔ 結晶 , Diamond⇔ 鑽石 (zuan shi)
─ Phoneme transliteration keywords Dera ⇔ 德拉 (de la), Monte⇔ 蒙特 (meng te), Los ⇔ 洛斯 (luo si) 伊利莎白 (yi li sha bai), Edward ⇔ 愛德華 (ai de hua) Total 39 terms belong to this type. It occupies 31.97%.
─ Some keywords in type (1) are transliterated Bay ⇔ 貝 (Bay), Beach ⇔ 比奇 (bi qi) Total 14 keywords (11.48%) are extracted.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental Results
NICT location corpus─ Total 122 keyword pairs are identified─ Total 230 transformation rules─ On the average, a keyword pair corresponds to 1.89 tra
nsformation rules keyword pair mountain ⇔ 山 (shan)
─ Four transformation rules (1) γα ⇔ δβ (234) (2) γ, α ⇔ δβ (45) (3) γ, αγ ⇔ δβ (1) (4) γαγ ⇔ δβ (1)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Application on CLIR
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusion and Remarks
This paper proposes corpus-based approaches ─ extract the formulation rules and the translation/transliteration
rules among multilingual named entities
Two types of evaluation─ partition the corpora into two parts, one for training and the other
one for testing─ integrating our method in a cross language information retrieval
system
Further applications ─ will be explored in the future and the methodology will be
extended to other types of named entities
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Personal Opinion
Drawback─ Lack analysis about time complexity
Application─ Construct Chinese-English rules apply to IR
Future Work─ Adopt transliterated / translated term issue
Recommended