Upload
verda
View
47
Download
4
Embed Size (px)
DESCRIPTION
Introduction to the Monolingual and Domain-Specific Tasks of the Cross-language Evaluation Forum 2003. Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin, Humboldt-University Berlin) [email protected]. Monolingual Task. Languages: - PowerPoint PPT Presentation
Citation preview
CLEF WorkshopECDL 2003 Trondheim
21.-22.08.2003
Michael Kluckslide 1
Introduction to the Monolingual and
Domain-Specific Tasks of the Cross-language
EvaluationForum 2003
Michael Kluck(Informationszentrum
Sozialwissenschaften – IZ, Bonn/Berlin,Humboldt-University Berlin)
CLEF WorkshopECDL 2003 Trondheim
21.-22.08.2003
Michael Kluckslide 2
Monolingual Task • Languages:
– Dutch, Finnish, French, German, Italian, Spanish, Swedish
– New: Russian (with reduced topic set, because of the time span of the data)
– exclusion of English (widely used in TRE etc., overflow of runs; only newcomers)
• Aim:– Building a starting-point for
CLIR– Enlarge and balance the pool– Use of recently introduced or
new languages in the CLEF campaign
CLEF WorkshopECDL 2003 Trondheim
21.-22.08.2003
Michael Kluckslide 3
Monolingual runsby 22 participants
Lang. Deliver-ed runs
Judged runs
%
DE 28 18 64
EN 11 3 27
ES 40 11 28
FI 16 16 100
FR 35 17 49
IT 24 8 33
NL 32 15 47
RU 22 22 100
SV 18 18 100
CLEF WorkshopECDL 2003 Trondheim
21.-22.08.2003
Michael Kluckslide 4
Domain-Specific Task
• Amaryllis – could not be continued
because of lack of funding in France
– trying to get social science data from INIST failed
• GIRT– New bigger corpus GIRT4 in
German from social science literature and current research information
– Parallel corpus in English, although with smaller amount of text compared to the German part
CLEF WorkshopECDL 2003 Trondheim
21.-22.08.2003
Michael Kluckslide 5
Features of GIRT4
• Bigger than GIRT3, now: 320,638 documents– 151,319 original German– 151,319 translated into
English• Pseudo-parallel corpus:
– Title, Controlled-Term, Classification-Text available in German and English for all documents
– Abstract available for 96% in German, only for 15 % in English -> reduced amount of text for the English part
– Translated texts (Abstract) are sometimes result of machine translation by SYSTRAN (EU)
– Renumbered
CLEF WorkshopECDL 2003 Trondheim
21.-22.08.2003
Michael Kluckslide 6
Field Availability in GIRT4
• Equal distribution for the German and English part:– Title: 1 per doc
• On average:
– Controlled-Terms: 10.15 per doc– Classification-Text: 2.02 per doc
• Different distribution for the German and English part:
• On average:
– Method-Term• DE 2.35 per doc• EN 1.93 per doc
– Abstract• DE 0.96 per doc• EN 0.15 per doc
CLEF WorkshopECDL 2003 Trondheim
21.-22.08.2003
Michael Kluckslide 7
GIRT4 Tasks
• Monolingual– DE topics -> DE data– EN topics -> EN data
• Bilingual– EN or RU topics -> DE data– DE or RU topics -> EN data
• Additional instruments– German-English thesaurus– German-Russian translation
table (not fully up-to-date)
• Concordance list of document numbers– Will be available by end of
August 2003
CLEF WorkshopECDL 2003 Trondheim
21.-22.08.2003
Michael Kluckslide 8
Assessment of GIRT4• 17,031 docs, +65 %• Started with the German part• Then identified the identical English
documents (if they had been indicated as relevant hits)
• Continued with those hits in the English part that have been indicated as relevant (without having counterparts in the German part)
• During assessment it showed up that the search results in the different language parts have not been fully congruent– For a given topic the result hits in the
English part have not been identical with those in the German part (without knowing which was belonging to what run)
CLEF WorkshopECDL 2003 Trondheim
21.-22.08.2003
Michael Kluckslide 9
GIRT4 runs by 4 participants
Data Topic lang.
judgedruns
GIRT4 DE
DE 13 Mono-lingual
17 GIRT4
EN
EN 4
GIRT4 DE
EN 1 Bilin-gual
5GIRT4 DE
RU 2
GIRT4 EN
DE 1
GIRT4 EN
RU 1