Upload
delora
View
41
Download
0
Embed Size (px)
DESCRIPTION
Translation of Web Queries Using Anchor Text Mining. Advisor : Dr. Hsu Graduate : Wen-Hsiang Hu Authors : Wen-Hsiang Lu. ACM, June 2002. Outline. Motivation Objective Introduction Anchor Text Mining Probabilistic Inference Model Query Translation System Experiments Discussion - PowerPoint PPT Presentation
Citation preview
1Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
Translation of Web Queries Using Anchor Text Mining
Advisor : Dr. Hsu
Graduate : Wen-Hsiang Hu
Authors : Wen-Hsiang Lu
ACM, June 2002
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline Motivation Objective Introduction Anchor Text Mining Probabilistic Inference Model Query Translation System Experiments Discussion Conclusion Personal Opinion
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation One of the existing difficulties in cross-language
information retrieval (CLIR) and Web search is the lack of appropriate translations of new terminology and proper names.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective automatically extracting translations of Web query
terms
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction In this paper, we are interested in
discovering translations of new
terminology and proper names
through mining Web anchor texts. the problems of precious research methods
parallel corpora for various
subject and multiple languages lack of parallel correlation
between word pairs short query terms
Yahoo 雅虎
Yahoo 雅虎
美國雅虎
搜尋、雅虎 ..
雅虎
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
We use a triple form <Uj,Ui,Dk>
to indicate that page Uj points to
page Ui with description text Dk.
For a Web page (or URL) Ui, its anchor-text set AT(Ui) is defined as all of the anchor texts of the links pointing to Ui,
i.e., Ui ’s inlinks.
For a query term appearing in AT(Ui), it is likely that its corresponding translations also appear together.
Anchor Text MiningUi
Uj
Uj
Uj
Uj
Uj
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
asymmetric similarity estimation model cause some common terms may become the best translations.
symmetric similarity estimation function based on the probabilistic inference model defined first below:
Probabilistic Inference Model
where Tt is target translation ; Ts is source term,
the inductive rule “if Ts then Tt”, i.e. P( Ts→Tt).
(2)
the inductive rules “if Ts then Tt” and “if Tt then Ts”, i.e. P( Ts Tt).
Total: 100 anchor-text Ts:Yahoo (only one anchor text) ; Tt: 雅虎 (10 anchor text )
雅虎 Yahoo P( Tt | Ts) = 0.01/ 0.01 = 1
雅虎 動物 P( Ts Tt ) = 0.01/ [(0.01+0.1)-0.01] = 0.1
雅虎 企業
………….
100
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Let U=(U1,U2,…,Un) be a concept space (Web page space), consisting of a set of pair-wised disjoint basic concepts (Web pages), i.e., Ui∩Uj = for i≠j. We can rewrite Eq.(2) as follows:∅
Probabilistic Inference Model (cont.)
where L(Uj) = the number of in-links of pages Uj
Uj
15
L(Ui)
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
We assume that Ts and Tt are independent given Ui; then the joint probability P(Ts∩Tt|Ui) is equal to the product of P(Ts|Ui) and P(Tt|Ui)
the above estimation approach considers the link information and degree of authority among Web pages.
Probabilistic Inference Model (cont.)
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
three different methods to extract Chinese terms: PAT-tree-based
1. check if the strings of candidate terms are complete in a lexical boundary
2. decide the importance of a term, based on its relative frequency
Query-set-based take queries from search engines query sets of different sizes
Tagger-based use the CKIP’s tagger extract unknown words
Query Translation System
Yahoo 雅虎
搜尋、雅虎
雅虎
美國雅虎
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental Environment Collected popular query terms with the logs from Dreamer
and GAIS. These query terms were taken as the major test set in our
term translation extraction analysis. We filtered out the terms that had no corresponding
Chinese translations in the anchor-text database and picked up 622 English terms as the source query set.
Experiments
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Evaluation Metric For a set of test query terms, its top-n inclusion rate is
defined as the percentage of the query terms whose effective translation (s) can be found in the top n extracted translations.
Experiments (cont.)
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Performance with Various Similarity Estimation Models MA, Asymmetric model as
MAL, Asymmetric model with link information:
MS, Symmetric model as
MSL, Symmetric model with link
information as (the proposed model).
622 English query terms and
query-set-based method
Experiments (cont.)
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Performance with Various Term Extraction Methods use MSL as similarity estimation model
Experiments (cont.)
PAT-tree-based
Query-set-based
Tagger-based
longer-translations ○ ○ X
short-translations ○ ○
low-frequency X ○ ○
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Performance with Various Query-Set Sizes medium-sized query set achieved the best performance.
Example: "sakura" 9709 terms: 台灣櫻花 (Taiwan Sakura Corporation); 櫻花 (sakur
a); 蜘蛛網 (spiderweb); 純愛 (pure love); and 螢幕保護 (screen saving)
228,566 terms: 庫洛魔法使 (Card Captor Sakura); 櫻花建設 (Sakura Development Corporation); 模仿 (imitation); 櫻花大戰 (Sakura Wars); 美夕 (Miyu, name of an actresss); 台灣櫻花 (Taiwan Sakura Corporation); 櫻花 (sakura); 蜘蛛網 (spiderweb); 純愛 (pure love); and 螢幕保護 (screen saving)
Experiments (cont.)
might also produce more noise
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Discussion
Comparisons with a translation lexicon Queries suitable for finding translations Extracting domain-specific translations Experiments on Simplified Chinese pages
17
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
proposing a new and effective approach for mining Web link structures and anchor texts for translations of Web query terms.
Future research combining more in-depth linguistic knowledge to remove
noisy terms.
Conclusion
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
……..
Personal Opinion