Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

Preview:

DESCRIPTION

Introduction to the Monolingual and Domain-Specific Tasks of the Cross-language Evaluation Forum 2003. Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin, Humboldt-University Berlin) kluck@bonn.iz-soz.de. Monolingual Task. Languages: - PowerPoint PPT Presentation

Citation preview

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 1

Introduction to the Monolingual and

Domain-Specific Tasks of the Cross-language

EvaluationForum 2003

Michael Kluck(Informationszentrum

Sozialwissenschaften – IZ, Bonn/Berlin,Humboldt-University Berlin)

kluck@bonn.iz-soz.de

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 2

Monolingual Task • Languages:

– Dutch, Finnish, French, German, Italian, Spanish, Swedish

– New: Russian (with reduced topic set, because of the time span of the data)

– exclusion of English (widely used in TRE etc., overflow of runs; only newcomers)

• Aim:– Building a starting-point for

CLIR– Enlarge and balance the pool– Use of recently introduced or

new languages in the CLEF campaign

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 3

Monolingual runsby 22 participants

Lang. Deliver-ed runs

Judged runs

%

DE 28 18 64

EN 11 3 27

ES 40 11 28

FI 16 16 100

FR 35 17 49

IT 24 8 33

NL 32 15 47

RU 22 22 100

SV 18 18 100

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 4

Domain-Specific Task

• Amaryllis – could not be continued

because of lack of funding in France

– trying to get social science data from INIST failed

• GIRT– New bigger corpus GIRT4 in

German from social science literature and current research information

– Parallel corpus in English, although with smaller amount of text compared to the German part

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 5

Features of GIRT4

• Bigger than GIRT3, now: 320,638 documents– 151,319 original German– 151,319 translated into

English• Pseudo-parallel corpus:

– Title, Controlled-Term, Classification-Text available in German and English for all documents

– Abstract available for 96% in German, only for 15 % in English -> reduced amount of text for the English part

– Translated texts (Abstract) are sometimes result of machine translation by SYSTRAN (EU)

– Renumbered

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 6

Field Availability in GIRT4

• Equal distribution for the German and English part:– Title: 1 per doc

• On average:

– Controlled-Terms: 10.15 per doc– Classification-Text: 2.02 per doc

• Different distribution for the German and English part:

• On average:

– Method-Term• DE 2.35 per doc• EN 1.93 per doc

– Abstract• DE 0.96 per doc• EN 0.15 per doc

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 7

GIRT4 Tasks

• Monolingual– DE topics -> DE data– EN topics -> EN data

• Bilingual– EN or RU topics -> DE data– DE or RU topics -> EN data

• Additional instruments– German-English thesaurus– German-Russian translation

table (not fully up-to-date)

• Concordance list of document numbers– Will be available by end of

August 2003

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 8

Assessment of GIRT4• 17,031 docs, +65 %• Started with the German part• Then identified the identical English

documents (if they had been indicated as relevant hits)

• Continued with those hits in the English part that have been indicated as relevant (without having counterparts in the German part)

• During assessment it showed up that the search results in the different language parts have not been fully congruent– For a given topic the result hits in the

English part have not been identical with those in the German part (without knowing which was belonging to what run)

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 9

GIRT4 runs by 4 participants

Data Topic lang.

judgedruns

GIRT4 DE

DE 13 Mono-lingual

17 GIRT4

EN

EN 4

GIRT4 DE

EN 1 Bilin-gual

5GIRT4 DE

RU 2

GIRT4 EN

DE 1

GIRT4 EN

RU 1

Recommended