9
CLEF Workshop ECDL 2003 Trondheim 21.-22.08.2003 Michael Kluck slide 1 Introduction to the Monolingual and Domain-Specific Tasks of the Cross- language Evaluation Forum 2003 Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin, Humboldt-University Berlin) [email protected]

Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

  • Upload
    verda

  • View
    47

  • Download
    4

Embed Size (px)

DESCRIPTION

Introduction to the Monolingual and Domain-Specific Tasks of the Cross-language Evaluation Forum 2003. Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin, Humboldt-University Berlin) [email protected]. Monolingual Task. Languages: - PowerPoint PPT Presentation

Citation preview

Page 1: Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 1

Introduction to the Monolingual and

Domain-Specific Tasks of the Cross-language

EvaluationForum 2003

Michael Kluck(Informationszentrum

Sozialwissenschaften – IZ, Bonn/Berlin,Humboldt-University Berlin)

[email protected]

Page 2: Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 2

Monolingual Task • Languages:

– Dutch, Finnish, French, German, Italian, Spanish, Swedish

– New: Russian (with reduced topic set, because of the time span of the data)

– exclusion of English (widely used in TRE etc., overflow of runs; only newcomers)

• Aim:– Building a starting-point for

CLIR– Enlarge and balance the pool– Use of recently introduced or

new languages in the CLEF campaign

Page 3: Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 3

Monolingual runsby 22 participants

Lang. Deliver-ed runs

Judged runs

%

DE 28 18 64

EN 11 3 27

ES 40 11 28

FI 16 16 100

FR 35 17 49

IT 24 8 33

NL 32 15 47

RU 22 22 100

SV 18 18 100

Page 4: Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 4

Domain-Specific Task

• Amaryllis – could not be continued

because of lack of funding in France

– trying to get social science data from INIST failed

• GIRT– New bigger corpus GIRT4 in

German from social science literature and current research information

– Parallel corpus in English, although with smaller amount of text compared to the German part

Page 5: Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 5

Features of GIRT4

• Bigger than GIRT3, now: 320,638 documents– 151,319 original German– 151,319 translated into

English• Pseudo-parallel corpus:

– Title, Controlled-Term, Classification-Text available in German and English for all documents

– Abstract available for 96% in German, only for 15 % in English -> reduced amount of text for the English part

– Translated texts (Abstract) are sometimes result of machine translation by SYSTRAN (EU)

– Renumbered

Page 6: Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 6

Field Availability in GIRT4

• Equal distribution for the German and English part:– Title: 1 per doc

• On average:

– Controlled-Terms: 10.15 per doc– Classification-Text: 2.02 per doc

• Different distribution for the German and English part:

• On average:

– Method-Term• DE 2.35 per doc• EN 1.93 per doc

– Abstract• DE 0.96 per doc• EN 0.15 per doc

Page 7: Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 7

GIRT4 Tasks

• Monolingual– DE topics -> DE data– EN topics -> EN data

• Bilingual– EN or RU topics -> DE data– DE or RU topics -> EN data

• Additional instruments– German-English thesaurus– German-Russian translation

table (not fully up-to-date)

• Concordance list of document numbers– Will be available by end of

August 2003

Page 8: Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 8

Assessment of GIRT4• 17,031 docs, +65 %• Started with the German part• Then identified the identical English

documents (if they had been indicated as relevant hits)

• Continued with those hits in the English part that have been indicated as relevant (without having counterparts in the German part)

• During assessment it showed up that the search results in the different language parts have not been fully congruent– For a given topic the result hits in the

English part have not been identical with those in the German part (without knowing which was belonging to what run)

Page 9: Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

CLEF WorkshopECDL 2003 Trondheim

21.-22.08.2003

Michael Kluckslide 9

GIRT4 runs by 4 participants

Data Topic lang.

judgedruns

GIRT4 DE

DE 13 Mono-lingual

17 GIRT4

EN

EN 4

GIRT4 DE

EN 1 Bilin-gual

5GIRT4 DE

RU 2

GIRT4 EN

DE 1

GIRT4 EN

RU 1