17
Combining Query Translation and Document Translation in Cross-Language Retrieval Aitao Chen & Fredric C. Gey* School of Information Management and Systems *UC Data Archive & Technical Assistance University of California at Berkeley CLEF 2003 Workshop: 21-22 August, 2003, Trondheim, Norway

Combining Query Translation and Document Translation in Cross-Language Retrieval

  • Upload
    zurina

  • View
    35

  • Download
    2

Embed Size (px)

DESCRIPTION

Combining Query Translation and Document Translation in Cross-Language Retrieval. Aitao Chen & Fredric C. Gey* School of Information Management and Systems *UC Data Archive & Technical Assistance University of California at Berkeley. CLEF 2003 Workshop: 21-22 August, 2003, Trondheim, Norway. - PowerPoint PPT Presentation

Citation preview

Page 1: Combining Query Translation and Document Translation in Cross-Language Retrieval

Combining Query Translation and Document Translation in Cross-

Language Retrieval

Aitao Chen & Fredric C. Gey*

School of Information Management and Systems

*UC Data Archive & Technical Assistance

University of California at Berkeley

CLEF 2003 Workshop: 21-22 August, 2003, Trondheim, Norway

Page 2: Combining Query Translation and Document Translation in Cross-Language Retrieval

Talk Outline

• Development of new resources• Fast approximate document translation• Combining query translation and document

translation• Conclusions

Page 3: Combining Query Translation and Document Translation in Cross-Language Retrieval

New Resources

• Finnish and Swedish stoplists• Base Finnish and Swedish lexicons for

decompounding• Statistical translation lexicons derived from

parallel texts• Finnish and Swedish statistical stemmers

automatically generated from parallel texts• English spelling normalizer

Page 4: Combining Query Translation and Document Translation in Cross-Language Retrieval

Development of Swedish Stoplist(by someone who doesn’t know Swedish)

Look for Swedish words whose English translations are English stopwords in Swedish textbooks (e.g., grammar) written in English.

• en park (a park)• ett piano (a piano)• Jag vet inte mycket om honom (I don’t know much

about him)• efter skolan (after school)• Hans och Greta (Hans and Greta)

(Source: Swedish: A comprehensive grammar by P. Holmes & I. Hinchliffe)

Page 5: Combining Query Translation and Document Translation in Cross-Language Retrieval

Development of Swedish Base Lexicon

A base lexicon should contain all and only the words and their variants that are not compounds.

• Compile a list of Swedish words (e.g., from the Swedish document collection).

• Remove the words that are 4 or fewer characters long.

• Remove the long words that can be decomposed into short words in the initial wordlist.

animationanimationendatordatoranimationdatorgrafikdatorteknologidatorvirusgrafikteknologivirus

dator animationdator grafikdator teknologidator virus

Remove the compounds that are decomposed.

Page 6: Combining Query Translation and Document Translation in Cross-Language Retrieval

Development of Statistical Translation Lexicons from Parallel Texts

parallel texts(EU Official Journal)

PDFtextsconversion

paragraph &sentence alignment

statisticalMT toolkit

statisticalassociation

1. English Dutch2. English Finnish3. English Swedish4. Dutch English5. Finnish English6. Swedish English

1. Italian Spanish2. German Italian3. Finnish German

statisticaltranslationlexicons

Page 7: Combining Query Translation and Document Translation in Cross-Language Retrieval

Development of Statistical Stemmers

dator datorn datorerdatorersomdatornätdatornernädiamanten diamanterna diamanterdiamantinformatik

diamond diamonds

computer computers

diamond

computer

Swedish words

diamanten diamanterna diamanterdiamant

dator datorn datorerdatorersomdatornätdatornernäinformatik

“computer” cluster

“diamond” cluster

statistical English translations

dator

diamant

Page 8: Combining Query Translation and Document Translation in Cross-Language Retrieval

Fast Approximate Document Translation

Spanishdocuments

Spanish-EnglishMT

List of Spanish words

List of English words

BilingualSpanish-English

wordlistEnglish

translations

12

3

4

Word-by-word

Page 9: Combining Query Translation and Document Translation in Cross-Language Retrieval

Query Translation-based Multilingual Retrieval

English

French

German

English docs French docs German docs

merger

combined ranked list of documents

German

French

English

Query Documents

Spanish Spanish

IR

IR

IR

IR

Spanish docs

L&H

Page 10: Combining Query Translation and Document Translation in Cross-Language Retrieval

Documentation Translation-based Multilingual Retrieval

English

English

English

unified ranked list of documents

German

French

English

Query

Documents

English Spanish

IR

Page 11: Combining Query Translation and Document Translation in Cross-Language Retrieval

Evaluation of Multilingual Retrieval

Run ID Trans. method Merging method Average precision

bkmul4en1 query-trans raw score 0.3783

bkmul4en2 doc-trans none 0.4082

bkmul4en3 query & doc-trans raw score 0.4260

Run ID Trans. method Merging method Average precision

bkmul8en1 query-trans raw score 0.3317

bkmul8en2 doc-trans none 0.3401

bkmul8en3 query & doc-trans raw score 0.3733

Multilingual-4: English, TD

Multilingual-8: English, TD

Page 12: Combining Query Translation and Document Translation in Cross-Language Retrieval

Query Translation v.s. Document Translation

celíacos dietasDiets for Celiacs

Las Dietas para Celiacs

English words in topic 186

Spanish doc words

query translation document translation(word-by-word)

Nahrungen für Celiacs

diät zöliakie

celiacs diets diets coeliac diseases

German doc words

Average precision: 0.0003 (mul4en1) Average precision: 0.6750 (mul4en2)

(Spanish) (German) (English)

Dutch Netherlands

Hollandais Hollande

(French)

query translation

Néerlandais Pays-Bas

Dutch Netherlands

French document words

(English)

English words in topic 161

Average precision: 0.2213 (mul4en1) Average precision: 0.6167 (mul4en2)

document translation(word-by-word)

0.01.0

Page 13: Combining Query Translation and Document Translation in Cross-Language Retrieval

Manual v.s. Automatic Stemming

Language No stemming Manual (Snowball) Automatic (parallel texts)

Finnish 0.3801 0.4972 0.4304

Swedish 0.3630 0.4121 0.3844

Language No Stemming Manual (Muscat) Automatic (L&H MT)

French 0.3905 0.4528 0.4521

Italian 0.3801 0.4324 0.4322

Spanish 0.4687 0.5166 0.5285

CLEF 2003

CLEF2001-2002

(topic fields: TD. No decompounding or query expansion)

(topic fields: TD. No query expansion)

Page 14: Combining Query Translation and Document Translation in Cross-Language Retrieval

Evaluation of Decompounding, Stemming and Query Expansion in Monolingual

Retrieval

baseline

decomp stem expan

decomp+stem decomp+expan stem+expan

decomp+stem+expan

.4342

.3727

.3801

.3630

.4744

.4294

.4204

.4331

.4480

.4220

.4974

.4121

.4673

.4867

.4071

.4224

.5304 (22.16%)

.5678 (52.35%)

.5633 (48.20%)

.5465 (50.55%)

.4955

.5111

.4972

.4727

.5126

.5473

.4469

.4880

.4962

.4804

.5541

.4838

Topics (TD)DutchGermanFinnishSwedish

Page 15: Combining Query Translation and Document Translation in Cross-Language Retrieval

Conclusions

• Fast approximate document-translation worked well. Combining document-translation with query-translation was even better.

• Decompounding with stemming and query expansion worked well for languages with rich compounds.

• Statistical stemmers derived from parallel texts were not as effective as manually built stemmers for Finnish and Swedish. But there is still room for improving statistical stemmers.

Page 16: Combining Query Translation and Document Translation in Cross-Language Retrieval

Berkeley Text Retrieval System is available for research purpose. Send request to [email protected]

Software

Page 17: Combining Query Translation and Document Translation in Cross-Language Retrieval

THANK YOU