16
Automatic translation quality Automatic translation quality control using Eurovoc control using Eurovoc descriptors descriptors Marko Tadić, Božo Bekavac ([email protected], http://www.hnk.ffzg.hr/mt/ [email protected], http://www.hnk.ffzg.hr/bb/) Department of linguistics / Institute of linguistics, Faculty of Philosophy, University of Zagreb (www.ffzg.hr, www.hnk.ffzg.hr) JRC Ispra / Arona, 2005-09-27

Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac ([email protected], [email protected],

Embed Size (px)

Citation preview

Page 1: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

Automatic translation quality control Automatic translation quality control using Eurovoc descriptorsusing Eurovoc descriptors

Marko Tadić, Božo Bekavac([email protected], http://www.hnk.ffzg.hr/mt/[email protected], http://www.hnk.ffzg.hr/bb/)

Department of linguistics /Institute of linguistics, Faculty of Philosophy, University of Zagreb (www.ffzg.hr, www.hnk.ffzg.hr)

JRC Ispra / Arona, 2005-09-27

Page 2: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

Talk planTalk plan

motivation

automatic translation quality control

resources: glossary and test corpus

results

further directions

Page 3: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

MotivationMotivation

Acquis communautaire (AC) is still being translated to Croatian

former Ministry of European Integrations (MEI),today Ministry of External Affairs and European Integrations

AC– ca 200,000 pages of EU OJ

– AC corpus: from 8 Mw (Estonian) to 82 Mw (Spanish)

– not precisely delimited (lawyers are working on that!)

– constantly growing

– legal texts• a lot of repetitious and formulaic expressions• low polysemy in terms expected

Page 4: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

Motivation 2Motivation 2

different EU accession candidates different organization of translation process– several years of work– large number of translators– in-house/out-house (tenders)– large-scale document translation and revision

MEI – outsourcing to ca 100 translators or translating companies– use of glossary with pre-established TEs– glossaries being translated in advance

• Eurovoc• EU Law Glossary / Čtyřjazyčný slovník práva Evropské unie, Prague 1999

maintain the consistency of translation– by use of the same glossary only?

Page 5: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

Preparing AC originals for translationPreparing AC originals for translation

project proposed by our Institute to MEI in 2002

entries from glossary marked in original text before translation

signal of the existence of pre-established TE to the translator

obligatory usage of existing TE in legal texts, e.g.:– Council of Europe Vijeće Europe

– European Council Europsko vijeće

– Council of the European Union Vijeće Europske unije

...

AC had to be converted to XML

MEI dropped the project in 2003 for the lack of finances

now: AC corpus in XML

Page 6: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

Revision of TranslationRevision of Translation

largest effort was put on translation in all candidate countries

revision of translation always in the last place– quality: consistency

task undermined by all candidate countries– large portions of official translation of AC poorly revised

usually done– manually

– simple search & replace commands

– no terms/entries marked in texts

automatic approach?– lexical level and idiomatic level

Page 7: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

Automatic Translation Quality ControlAutomatic Translation Quality Control

use system to check whether all pre-established TE are used– sentence aligned parallel corpus– glossary entries marked in

• original text• translated text

if a TE of a glossary entry found in original, has not been found in aligned translated sentence

translation is departing from pre-established TE

e.g.:– Eurovoc: (en) President of the Commission = (hr) Predsjednik

Komisije– Corpus: (en) … if the President of the Commission declares …

(hr1) … ako Predsjednik Komisije objavi …(hr2) … ako Predsjednik objavi …

Page 8: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

ResourcesResources

our lexicon: Eurovoc 4.1– documentational indexing glossary

– ca 6000 entries (descriptors) covering topics found in EU legal texts

– accompanied by non-descriptors (synonyms)

– translated to Croatian in 2000

– + 4000 Croatian specific descriptors

– translation always 1:1

– combination of nouns, adjectives, prepositions, conjunctions

our corpus: 9 documents from AC corpus and their translations from MEI– size: 16.053 tokens (en)

13.590 tokens (hr)

– Croatian translations converted to AC corpus XML format

Page 9: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,
Page 10: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

MethodMethod

simple glossary look-up?

problem of inflection– English

• at least: sg, pl, ’s

– Croatian• 7 cases 2 numbers for nouns• 7 cases 2 numbers 3 genders 2 definiteness 3 comparison for

Adjectives

lemmatization of corpus or glossary?

Eurovoc lemmatized and converted to FSA: Intex– English lemmatizer from Intex for English Eurovoc

– Croatian Lemmatization Server (hml.ffzg.hr) for Croatian Eurovoc

– FSA with 10430 states

Page 11: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

Eurovoc as FSAEurovoc as FSA

"<diskrecijski><pravo>/<EVD lang=\"hr\" id=\"001444\">" 320 300 1 2 "<pravo><imenovanje>/<EVD lang=\"hr\" id=\"003048\">" 320 300 1 2 "<pravo><nadzor>/<EVD lang=\"hr\" id=\"003646\">" 320 300 1 2 "<nadzoran><tijelo>/<EVD lang=\"hr\" id=\"005492\">" 320 300 1 2 "<pravo><odluèivanje>/<EVD lang=\"hr\" id=\"003043\">" 320 300 1 2 "<pravo><poticanje>/<EVD lang=\"hr\" id=\"003045\">" 320 300 1 2 "<pravo><pregovaranje>/<EVD lang=\"hr\" id=\"003049\">" 320 300 1 2 "<pravo><procjena>/<EVD lang=\"hr\" id=\"003042\">" 320 300 1 2 "<pravo><provedba>/<EVD lang=\"hr\" id=\"003044\">" 320 300 1 2 "<pravo><ratifikacija>/<EVD lang=\"hr\" id=\"003046\">" 320 300 1 2 "<savjetodavan><pravo>/<EVD lang=\"hr\" id=\"000717\">" 320 300 1 2 "<veto>/<EVD lang=\"hr\" id=\"003964\">" 320 300 1 2 "<politièki><sustav>/<EVD lang=\"hr\" id=\"000153\">" 320 300 1 2 "<autoritaran><režim>/<EVD lang=\"hr\" id=\"000849\">" 320 300 1 2 "<diktatura>/<EVD lang=\"hr\" id=\"001428\">" 320 300 1 2 "<dvostranaèki><sustav>/<EVD lang=\"hr\" id=\"003851\">" 320 300 1 2 "<federalizam>/<EVD lang=\"hr\" id=\"001830\">" 320 300 1 2 "<jednostranaèki><sustav>/<EVD lang=\"hr\" id=\"002835\">" 320 300 1 2 "<monokracija>/<EVD lang=\"hr\" id=\"002659\">" 320 300 1 2 "<narodan><demokracija>/<EVD lang=\"hr\" id=\"002933\">" 320 300 1 2 "<oligarhija>/<EVD lang=\"hr\" id=\"033057\">" 320 300 1 2 "<parlamentaran><sustav>/<EVD lang=\"hr\" id=\"002903\">" 320 300 1 2 "<pobunjenièki><vlada>/<EVD lang=\"hr\" id=\"003233\">" 320 300 1 2 "<predsjednièki><sustav>/<EVD lang=\"hr\" id=\"034211\">" 320 300 1 2 "<promjena><politièki><sustav>/<EVD lang=\"hr\" id=\"001128\">" 320 300 1 2 "<republika>/<EVD lang=\"hr\" id=\"003302\">" 320 300 1 2 "<ustavan><monarhija>/<EVD lang=\"hr\" id=\"001254\">" 320 300 1 2 "<višestranaèki><sustav>/<EVD lang=\"hr\" id=\"002676\">" 320 300 1 2 "<vlada><u><progonstvo>/<EVD lang=\"hr\" id=\"002028\">" 320 300 1 2 "<vojni><režim>/<EVD lang=\"hr\" id=\"002617\">" 320 300 1 2 "<politièki><stranka>/<EVD lang=\"hr\" id=\"000003\">" 320 300 1 2

Page 12: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,
Page 13: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

Method 2Method 2

glossary entires marked in corpus together with IDs– <EVD lang=“hr” id=“001747”>

checking whether the same ID appears on both sides of alignment (Perl script)

P1/005749P1/005494P2/005749P2/005494P7/002840P8/001952P8/001952P9/000060...

statistics en hr<P> 652 656

<EVD> 1328 1484

<EVD> with matched IDs 803 (60,47%)

matched <EVD>s are also word/phrase aligned parts below <P>

Page 14: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

DrawbacksDrawbacks

syntactic merging– abbreviations not matched / marked

(e.g. EP delegation vs. European Parliament delegation)

– merged terms not matched / marked(e.g. head of State, head of government vs. heads of State or Government)

EUROVOC = glossary intended for indexing– a lot of real terms (MWU) not matched / marked

(e.g. country candidate to EU accession, Stabilisation and Association Agreement) they don’t exist as entries

– no semantic processing polysemous terms wrongly matched / marked(e.g. ...which might lead (olovo) to a common defence...)

Intex English lemmatizer didn’t cover all Eurovoc entries

Page 15: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

Further directionsFurther directions

evaluation of matched pairs of <EVD> regarding– single-word units

– multi-word units

improving Intex English lemmatizer / lexicon

use Eurovoc non-descriptors as synonyms– to capture a wider departure from expected TE in translation more

precisely

use / include other glossaries– EU Law Glossary

test the whole system on larger corpus

use it with other languages

Page 16: Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac (marko.tadic@ffzg.hr,  bbekavac@ffzg.hr,

Automatic translation quality control Automatic translation quality control using Eurovoc descriptorsusing Eurovoc descriptors

Marko Tadić, Božo Bekavac([email protected], http://www.hnk.ffzg.hr/mt/[email protected], http://www.hnk.ffzg.hr/bb/)

Department of linguistics /Institute of linguistics, Faculty of Philosophy, University of Zagreb (www.ffzg.hr, www.hnk.ffzg.hr)

JRC Ispra / Arona, 2005-09-27