Upload
zurina
View
35
Download
2
Embed Size (px)
DESCRIPTION
Combining Query Translation and Document Translation in Cross-Language Retrieval. Aitao Chen & Fredric C. Gey* School of Information Management and Systems *UC Data Archive & Technical Assistance University of California at Berkeley. CLEF 2003 Workshop: 21-22 August, 2003, Trondheim, Norway. - PowerPoint PPT Presentation
Citation preview
Combining Query Translation and Document Translation in Cross-
Language Retrieval
Aitao Chen & Fredric C. Gey*
School of Information Management and Systems
*UC Data Archive & Technical Assistance
University of California at Berkeley
CLEF 2003 Workshop: 21-22 August, 2003, Trondheim, Norway
Talk Outline
• Development of new resources• Fast approximate document translation• Combining query translation and document
translation• Conclusions
New Resources
• Finnish and Swedish stoplists• Base Finnish and Swedish lexicons for
decompounding• Statistical translation lexicons derived from
parallel texts• Finnish and Swedish statistical stemmers
automatically generated from parallel texts• English spelling normalizer
Development of Swedish Stoplist(by someone who doesn’t know Swedish)
Look for Swedish words whose English translations are English stopwords in Swedish textbooks (e.g., grammar) written in English.
• en park (a park)• ett piano (a piano)• Jag vet inte mycket om honom (I don’t know much
about him)• efter skolan (after school)• Hans och Greta (Hans and Greta)
(Source: Swedish: A comprehensive grammar by P. Holmes & I. Hinchliffe)
Development of Swedish Base Lexicon
A base lexicon should contain all and only the words and their variants that are not compounds.
• Compile a list of Swedish words (e.g., from the Swedish document collection).
• Remove the words that are 4 or fewer characters long.
• Remove the long words that can be decomposed into short words in the initial wordlist.
animationanimationendatordatoranimationdatorgrafikdatorteknologidatorvirusgrafikteknologivirus
dator animationdator grafikdator teknologidator virus
Remove the compounds that are decomposed.
Development of Statistical Translation Lexicons from Parallel Texts
parallel texts(EU Official Journal)
PDFtextsconversion
paragraph &sentence alignment
statisticalMT toolkit
statisticalassociation
1. English Dutch2. English Finnish3. English Swedish4. Dutch English5. Finnish English6. Swedish English
1. Italian Spanish2. German Italian3. Finnish German
statisticaltranslationlexicons
Development of Statistical Stemmers
dator datorn datorerdatorersomdatornätdatornernädiamanten diamanterna diamanterdiamantinformatik
diamond diamonds
computer computers
diamond
computer
Swedish words
diamanten diamanterna diamanterdiamant
dator datorn datorerdatorersomdatornätdatornernäinformatik
“computer” cluster
“diamond” cluster
statistical English translations
dator
diamant
Fast Approximate Document Translation
Spanishdocuments
Spanish-EnglishMT
List of Spanish words
List of English words
BilingualSpanish-English
wordlistEnglish
translations
12
3
4
Word-by-word
Query Translation-based Multilingual Retrieval
English
French
German
English docs French docs German docs
merger
combined ranked list of documents
German
French
English
Query Documents
Spanish Spanish
IR
IR
IR
IR
Spanish docs
L&H
Documentation Translation-based Multilingual Retrieval
English
English
English
unified ranked list of documents
German
French
English
Query
Documents
English Spanish
IR
Evaluation of Multilingual Retrieval
Run ID Trans. method Merging method Average precision
bkmul4en1 query-trans raw score 0.3783
bkmul4en2 doc-trans none 0.4082
bkmul4en3 query & doc-trans raw score 0.4260
Run ID Trans. method Merging method Average precision
bkmul8en1 query-trans raw score 0.3317
bkmul8en2 doc-trans none 0.3401
bkmul8en3 query & doc-trans raw score 0.3733
Multilingual-4: English, TD
Multilingual-8: English, TD
Query Translation v.s. Document Translation
celíacos dietasDiets for Celiacs
Las Dietas para Celiacs
English words in topic 186
Spanish doc words
query translation document translation(word-by-word)
Nahrungen für Celiacs
diät zöliakie
celiacs diets diets coeliac diseases
German doc words
Average precision: 0.0003 (mul4en1) Average precision: 0.6750 (mul4en2)
(Spanish) (German) (English)
Dutch Netherlands
Hollandais Hollande
(French)
query translation
Néerlandais Pays-Bas
Dutch Netherlands
French document words
(English)
English words in topic 161
Average precision: 0.2213 (mul4en1) Average precision: 0.6167 (mul4en2)
document translation(word-by-word)
0.01.0
Manual v.s. Automatic Stemming
Language No stemming Manual (Snowball) Automatic (parallel texts)
Finnish 0.3801 0.4972 0.4304
Swedish 0.3630 0.4121 0.3844
Language No Stemming Manual (Muscat) Automatic (L&H MT)
French 0.3905 0.4528 0.4521
Italian 0.3801 0.4324 0.4322
Spanish 0.4687 0.5166 0.5285
CLEF 2003
CLEF2001-2002
(topic fields: TD. No decompounding or query expansion)
(topic fields: TD. No query expansion)
Evaluation of Decompounding, Stemming and Query Expansion in Monolingual
Retrieval
baseline
decomp stem expan
decomp+stem decomp+expan stem+expan
decomp+stem+expan
.4342
.3727
.3801
.3630
.4744
.4294
.4204
.4331
.4480
.4220
.4974
.4121
.4673
.4867
.4071
.4224
.5304 (22.16%)
.5678 (52.35%)
.5633 (48.20%)
.5465 (50.55%)
.4955
.5111
.4972
.4727
.5126
.5473
.4469
.4880
.4962
.4804
.5541
.4838
Topics (TD)DutchGermanFinnishSwedish
Conclusions
• Fast approximate document-translation worked well. Combining document-translation with query-translation was even better.
• Decompounding with stemming and query expansion worked well for languages with rich compounds.
• Statistical stemmers derived from parallel texts were not as effective as manually built stemmers for Finnish and Swedish. But there is still room for improving statistical stemmers.
Berkeley Text Retrieval System is available for research purpose. Send request to [email protected]
Software
THANK YOU